regex to extract all that in quotes
I am trying to write a regex to match all strings which appear in between enclosing characters (most likely " - double quotes). This is a scenario I commonly encounter while trying to parse a line in a csv file.
So I have a sample line like:
"Smith, John",25,"21/45, North Avenue",IBM
Tried the following regex:
But it fetches somewhat as follows:
I am expecting output as follows:
Smith, John 25 21/45, North Avenue IBM
The regex I have written is an attempt to capture what comes between " in my example. However, above is the output I am expecting.
There is a kind of ambiguity though: I am not looking for a match like: ,25,. This kinda makes me wonder if a regex is even feasible here.
What is the correct way to write this?
If you really want to roll your own CSV parser, you'll need to teach your regex a few rules:
- A field may be unquoted as long as it doesn't contains quotes, commas or newlines.
- A quoted field may contain any characters; quotes are escaped by doubling.
- Commas are used as separators.
So, to match one CSV field, you can use the following regex:
(?mx) # Verbose, multiline mode (?<=^|,) # Assert there is a comma or start of line before the current position. (?: # Start non-capturing group: " # Either match an opening quote, followed by (?: # a non-capturing group: "" # Either an escaped quote | # or [^"]+ # any characters except quotes )* # End of inner non-capturing group, repeat as needed. " # Match a closing quote. | # OR [^,"\r\n]+ # Match any number of characters except commas, quotes or newlines ) # End of outer non-capturing group (?=,|$) # Assert there is a comma or end-of-line after the current position
See it live on regex101.com.
Please don't use regex for this, CSV should be handled by a parser.
Here is a ready-to-use parser: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
You can also use the OLEDB built-in parser: http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser
Hope this helps