Why is this regex returning errors when I use it to fish img src's from HTML?

I'm writing a function that fishes out the src from the first image tag it finds in an html file. Following the instructions in this thread on here, I got something that seemed to be working:

preg_match_all('#<img[^>]*>#i', $content, $match); 

foreach ($match as $value) {
    $img = $value[0];
                           } 

$stuff = simplexml_load_string($img);
$stuff = $stuff[src];
return $stuff;

But after a few minutes of using the function, it started returning errors like this:

warning: simplexml_load_string() [0function.simplexml-load-string0]: Entity: line 1: parser error : Premature end of data in tag img line 1 in path/to/script on line 42.

and

warning: simplexml_load_string() [0function.simplexml-load-string0]: tp://feeds.feedburner.com/~f/ChicagobusinesscomBreakingNews?i=KiStN" border="0"> in path/to/script on line 42.

I'm kind of new to PHP but it seems like my regex is chopping up the HTML incorrectly. How can I make it more "airtight"?

Answers


These two lines of PHP code should give you a list of all the values of the src attribute in all img tags in an HTML file:

preg_match_all('/<img\s+[^<>]*src=["\']?([^"\'<>\s]+)["\']?/i', $content, $result, PREG_PATTERN_ORDER);
$result = $result[1];

To keep the regex simple, I'm not allowing file names to have spaces in them. If you want to allow this, you need to use separate alternatives for quoted attribute values (which can have spaces), and unquoted attribute values (which can't have spaces).


Most likely because the "XML" being picked up by the regex isn't proper XML for whatever reason. I would probably go for a more complicated regex that would pull out the src attribute, instead of using SimpleXML to get the src. This REGEX might be close to what you need.

<img[^>]*src\s*=\s*['|"]?([^>]*?)['|"]?[^>]*>

You could also use a real HTML Parsing library, but I'm not sure which options exist in PHP.


An ampersand by itself in an attribute is invalid XML (it should be encoded as “&amp;”), but some people still put it that way on URLs on HTML pages (and all browsers support it). Maybe there lies your problem.

If that is the case, you can sanitize your string before parsing it, substituting “&(?!amp;)” by “&amp;”.


On a different subject:

foreach ($match as $value) {
    $img = $value[0];
                           } 

can be replaced with

$img = $match[count($match) - 1][0];

Something like this:

if (preg_match('#<img\s[^>]*>#i', $content, $match)) {
    $img = $match[0]; //first image in file only
    $stuff = simplexml_load_string($img);
    $stuff = $stuff[src];
    return $stuff;
} else {
    return null; //no match found
}

Need Your Help