PHP's DOMXPath is stripping out my tags inside the matched text

I asked this question yesterday, and at the time it was just what I needed, but while working with some live data I discovered that is wasn't quite doing what I expected. Parse HTML with PHP's HTML DOMDocument

It gets the data from the HTML page, but then it also strips out all the HTML tags inside the captured block of text, which isn't what I want. (I might wan't to take some of the tags out, but not all, and this can be done later)

Answers


That's a common problem with DOM : you have to do a bit more work if you want to get the content of a tag, and the content of all its children.

Basically, you have to loop over the child nodes of the one you've matched with your XPath query, to get their contents.

There is a solution proposed in one one the user notes on the manual page of the DOMElement class -- see this note.

Integrating this solution into the code you already have should give you something that looks like this for the declaration of the HTML string, with sub-tags :

$html = <<<HTML
<div class="main">
    <div class="text">
        <p>
            Capture this <strong>text</strong> <em>1</em>
        </p>
        <p>
            And some other <strong>text</strong>
        </p>
    </div>
</div>
HTML;

And, to extract the data from that HTML string, you can use something like that :

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    $innerHTML = '';

    // see http://fr.php.net/manual/en/class.domelement.php#86803
    $children = $tag->childNodes;
    foreach ($children as $child) {
        $tmp_doc = new DOMDocument();
        $tmp_doc->appendChild($tmp_doc->importNode($child,true));       
        $innerHTML .= $tmp_doc->saveHTML();
    }

    var_dump(trim($innerHTML));
}

The only thing that has changed is the content of the foreach loop : instead of just using $tag->nodeValue, you have to iterate over the child elements.

Which gives me the following output :

string '<p>
            Capture this <strong>text</strong> <em>1</em>
        </p>


<p>
            And some other <strong>text</strong>
        </p>' (length=150)

Which is the full content of the <div> tag that was matched, and all its children -- including the tags.

Note : there are often interesting ideas and solution in the users notes of the manual ;-)


Pascal MARTIN's answer is great, but I found it can be simplified

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    $innerHTML = '';

    $children = $tag->childNodes;
    foreach ($children as $child) {     
        $innerHTML .= $dom->saveHTML($child);
    }

    var_dump(trim($innerHTML));
}

This way appears to produce the same result, but doesn't require new DomDocument objects being created inside the foreach loop.

EDIT:

So, after further experimentation, you can actually reduce the above to this:

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    var_dump(trim($dom->saveHTML($tag)));
}

Need Your Help

Tracking requirements across multiple projects with JIRA (or other tools)

project-management jira bug-tracking

My company has been using JIRA as a requirements tracking tool as well as a bug tracker, and it's been working pretty well while we've been working on one project at a time.

Jquery how to change class of clicked link

javascript jquery class hyperlink

Hi i want to know how can i change class of clicked link ?