Regex - Match an end html tag if start tag is not present
i want to get an ending html tag like </EM> only if somewhere before it i.e. before any previous tags or text there is no starting <EM> tag my sample string is
ddd d<STRONG>dfdsdsd dsdsddd<EM>ss</EM>r and</EM>and strong</STRONG>
in this string the output should be </EM> and this also the second </EM> because it lacks the starting <EM>. i have tried
but it doesnt seem to work please help thnks
I am not sure regex is best suited for this kind of task, since tags can always be nested.
Anyhow, a C# regex like:
would only bring the second </EM> tag
- ?! is a negative lookahead which explains why both </EM> are found. So... (?!=<EM>.*)xxx actually means capture xxx if it is not followed by =<EM>.*. I am not sure you wanted to include an = in there
- ?<! is a negative lookbehind, more suited to what you wanted to do, but which would not work with java regex engine, since this look-behind regex does not have an obvious maximum length.
However, with a .Net regex engine, as tested on RETester, it does work.
You need a pushdown automaton here. Regular expressions aren't powerful enough to capture this concept, since they are equivalent to finite-state automata, so a regex solution is strictly speaking a no-go.
That said, .NET regular expressions do have a pushdown automaton behind them so they can theoretically cope with such cases. If you really feel you need to do this with regular expressions rather than a formal HTML parser, take a glimpse here.
You should see the top answer to this other Stack Overflow question, because it gives the perfect answer. In short, don't use regular expressions to try to parse HTML - it's a really bad idea.