regex match keywords that are not in quotes

How will I be able to look for kewords that are not inside a string.

For example if I have the text:

Hello this text is an example.

bla bla bla "this text is inside a string"

"random string" more text bla bla bla "foo"

I will like to be able to match all the words text that are not inside " ". In other I will like to match:

note I do not want to match the text that is highlighted on red because it is inside a string


Possible solution:

I been working on it and this is what I have so far:

(?s)((?<q>")|text)(?(q).*?"|)

note that regex uses the if statement as: (?(predicate) true alternative|false alternative)

so the regex will read:

find " or text. If you find " then continue selecting until you find " again (.*?") if you find text then do nothing...

when I run that regex I match the whole string though. I am asking this question for purposes of learning. I know I can remove all strings then look for what I need.

Answers


Here is one answer:

(?<=^([^"]|"[^"]*")*)text

This means:

(?<=       # preceded by...
^          # the start of the string, then
([^"]      # either not a quote character
|"[^"]*"   # or a full string
)*         # as many times as you want
)
text       # then the text

You can easily extend this to handle strings containing escapes as well.

In C# code:

Regex.Match("bla bla bla \"this text is inside a string\"",
            "(?<=^([^\"]|\"[^\"]*\")*)text", RegexOptions.ExplicitCapture);

Added from comment discussion - extended version (match on a per-line basis and handle escapes). Use RegexOptions.Multiline for this:

(?<=^([^"\r\n]|"([^"\\\r\n]|\\.)*")*)text

In a C# string this looks like:

"(?<=^([^\"\r\n]|\"([^\"\\\\\r\n]|\\\\.)*\")*)text"

Since you now want to use ** instead of " here is a version for that:

(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text

Explanation:

(?<=       # preceded by
^          # start of line
 (         # either
 [^*\r\n]| #  not a star or line break
 \*(?!\*)| #  or a single star (star not followed by another star)
  \*\*     #  or 2 stars, followed by...
   ([^*\\\r\n] # either: not a star or a backslash or a linebreak
   |\\.        # or an escaped char
   |\*(?!\*)   # or a single star
   )*          # as many times as you want
  \*\*     # ended with 2 stars
 )*        # as many times as you want
)
text      # then the text

Since this version doesn't contain " characters it's cleaner to use a literal string:

@"(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text"

This can get pretty tricky, but here is one potential method that works by making sure that there is an even number of quotation marks between the matching text and the end of the string:

text(?=[^"]*(?:"[^"]*"[^"]*)*$)

Replace text with the regex that you want to match.

Rubular: http://www.rubular.com/r/cut5SeWxyK

Explanation:

text            # match the literal characters 'text'
(?=             # start lookahead
   [^"]*          # match any number of non-quote characters
   (?:            # start non-capturing group, repeated zero or more times
      "[^"]*"       # one quoted portion of text
      [^"]*         # any number of non-quote characters
   )*             # end non-capturing group
   $              # match end of the string
)               # end lookahead

I would simply greedily match the text's in quotes within a non-capturing group to filter them out and then use a capturing group for the non-quoted answer, like this:

".*(?:text).*"|(text)

which you might want to refine a little for word-boundaries etc. But this should get you where you wanna go, and be a clear readable sample.


I have used these answers a lot of times till now and want to share alternative approach of fixing this, as sometimes I was not able to implement and use the given answers.

Instead of matching keywords out of something, break the tasks to two sub tasks:

  1. replace everything you do not need to match with empty string
  2. use ordinary match

For example, to replace the text in quotes I use:

[dbo].[fn_Utils_RegexReplace] ([TSQLRepresentation_WHERE], '''.*?(?<!\\)''', '')

or more clear: '.*?(?<!\\)'.

I know that this may looks like double work and have performance impact on some platforms/languages, so everyone need to test this, too.


Need Your Help

Unable to launch the IIS Express Web server

visual-studio iis iis-express

I have an asp.net MVC 4 solution. When I try to open it using Visual studio 2012, I get following error: