0

I am facing problem in removing certain <tr> from html retrieved from remote page, the main

problem is that html is invalid or broken my code works well on testing on valid well

formatted html but when it comes to the code of the remote page it doesn't work after some

experiments if found that my be because the html code of the remote page is invalid

here is my code :

<?php
    //Get the url
    $url = "http://lsh.streamhunter.eu/static/section0.html";
    $html = file_get_contents($url);
    $doc = new DOMDocument(); // create DOMDocument
    @$doc->loadHTML($html); // load HTML you can add $html
    $xpath = new DOMXpath($doc);
    $elements = $xpath->query("//td[contains(., 'desktop')]"); // search td's that contain 'desktop'

    foreach($elements as $el){
        $parent = $el->parentNode;
        $parent->parentNode->removeChild($parent); // remove TR
        //$parent->removeChild($el); // remove TD
    }

    echo $doc->saveHTML(); // save new HTML
?>

it always give me 500 internal server error, although when i test it on well formatted html it works well?

is there any thing i am missing in the code above ? any suggestion to deal with this problem?

4

1 回答 1

0

问题是,当您删除 TR 时,下一个 TD 将是孤立的,您可能会收到该错误,因为该parentNode属性引用了一个不再存在的节点。

改为这样做:

$toRemove = array();

// gather a list of TRs to remove
foreach($elements as $el)
  if(!in_array($el->parentNode, $toRemove, true))
    $toRemove[] = $el->parentNode;

// remove them
foreach($toRemove as $tr)
  $tr->parentNode->removeChild($tr);

此外,要禁止验证警告,请添加:

libxml_use_internal_errors(true);

在加载您的 HTML 之前(并删除@运算符)。

于 2013-05-12T20:51:30.820 回答