Union in regular expression in R
I'm trying to use regular expressions in R to find one or more phrases within a vector of long sentences (which I'll call x).
So, for example, this works fine for one phrase:
But this doesn't work for two (or more) phrases:
grep("(phrase 1)+(phrase 2)+",x)
As I would expect. As I read it, this last one should give me all matches in x for 1 or more phrase 1s, and 1 or more phrase 2's. But it returns nothing.
You have to tell it to skip over any intervening characters:
grep("(phrase 1)+.*(phrase 2)+",x)
Also note that it will not reverse the order, so you might have to add that explicitly. Overall, it might be simpler to search each phrase separately (especially if there are more than two phrases), and then combine with intersect and union as you want to get overall results.
which(grepl("(phrase 1)+",x) & grepl("(phrase 2)+",x))
Full examples (e.g. with, you know, data ...) are always good.
The main key for regexps in R is to remember that there are three (!!) different engines. I tend to like the Perl regexps.
Next, it is important to remember that there are meta-character -- so if you want parens, you need to escape them.
With that, here is an example:
> txt <- c("The grey fox jumped", "The blue cat slept", "The sky was falling") > grep("blue", txt) # finds sentence two  2 > grep("(grey|blue)", txt, perl=TRUE) # finds one and two  1 2 > grep("(red|blue)", txt, perl=TRUE) # finds only two (as it should)  2 >
So with Perl regexps, you list alternatives inside parentheses, separated by a pipe symbol.
There's a way to do it with a single regex using lookaheads, though most regex engines will execute it pretty slowly:
> txt <- c("The grey fox jumped", "The blue cat slept", "The fox is grey", "The cat is grey") > grep("(?=.*fox)(?=.*grey)", txt, perl=TRUE)  1 3