.NET Regular Expressions in Infinite Cycle

I'm using .NET Regular Expressions to strip HTML code.

Using something like:

<title>(?<Title>[\w\W]+?)</title>[\w\W]+?<div class="article">(?<Text>[\w\W]+?)</div>

This works for 99% of the time, but sometimes, when parsing...

Regex.IsMatch(HTML, Pattern)

The parser just blocks and it will continue on this line of code for several minutes or indefinitely.

What's going on?


Your regex will work just fine when your HTML string actually contains HTML that fits the pattern. But when your HTML does not fit the pattern, e.g. if the last tag is missing, your regex will exhibit what I call "catastrophic backtracking". Click that link and scroll down to the "Quickly Matching a Complete HTML File" section. It describes your problem exactly. [\w\W]+? is a complicated way of saying .+? with RegexOptions.SingleLine.

With some effort, you can make regex work on html - however, have you looked at the HTML agility pack? This makes it much easier to work with html as a DOM, with support for xpath-type queries etc (i.e. "//div[@class='article']").

You're asking your regex to do a lot there. After every character, it has to look ahead to see if the next bit of text can be matched with the next part of the pattern.

Regex is a pattern matching tool. Whilst you can use it for simple parsing, you'd be better off using a specific parser (such as the HTML Agility pack, as mentioned my Marc).

Need Your Help

Google Chrome Javascript debugging issue

javascript debugging google-chrome

It used to be that when I stopped at a breakpoint in Google Chrome, and hovered over a variable, there would be a popup showing me the value of that variable. For some reason, that has simply stop...

How do you draw transparent polygons with Python?

python-imaging-library polygon alpha

I'm using PIL (Python Imaging Library). I'd like to draw transparent polygons. It seems that specifying a fill color that includes alpha level does not work. Are their workarounds?