Matching across a line vs matching words regex
Why is it that when I match across new lines it would seem that I can't identify individual words. For example:
content = "COAL_STORIES AUSTRALIA - blah blah blah BOTSWANA – blah blah blah URANIUM_STORIES AUSTRALIA – blah INDIA - blah COPPER_STORIES AUSTRALIA - blah blah blah AUSTRALIA - blah blah blah CHINA - blah blah blah ALUMINIUM_STORIES" sections = content.scan(/\w.*_.*\b/)
Give and array:
[  "COAL_STORIES",  "URANIUM_STORIES",  "COPPER_STORIES",  "ALUMINIUM_STORIES" ]
But if I try that using the 'm' flag everything gets matched:
sections = content.scan(/\w.*_.*\b/m) gives an array:
[  "COAL_STORIES\nAUSTRALIA - blah blah blah\nBOTSWANA – blah blah blah \n\nURANIUM_STORIES \nAUSTRALIA – blah\nINDIA - blah\n\nCOPPER_STORIES\nAUSTRALIA - blah blah blah\nAUSTRALIA - blah blah blah\nCHINA - blah blah blah\n\nALUMINIUM_STORIES" ]
As far as I can tell I'm still looking for the same word boundaries?
To elaborate on Casimir's comment:
.* is greedy... it will match the longest possible string it can, including newlines if you let it (which you can/did do by enabling multiline matching with \m).
In your first example .* will not match newlines, so \b is forced to match a word boundary on the same line as where \w matched.
In your second example .* will match across lines, so when \w matches your first character, \b is free to match any word boundary, even many lines away, as long as there's an _ somewhere between the two. Specifically, for you, it looks like:
- \w matched the first character in your input: "C" from "COAL_STORIES"
- .* matched everything up to "ALUMINUM" on the last line
- _ matched "_"
- .* matched "STORIES"
- \b matched the end of "STORIES"