我有一部分 html 结构不完整。例子:
<div id='notrequired'>
<div>
<h3>Some examples :-)</h3>
STL is a library, not a framework.
</div>
</p>
</a>
<a target='_blank' href='http://en.wikipedia.org/wiki/Library_%28computing%29'>Read more</a>;
</div>
<a target='_blank' href='http://en.wikipedia.org/wiki/Library_%28computing%29'>Read more</a>";
正如你在这里看到的,我有意想不到</p>
的</a>
标签。
我尝试了一段代码来删除它<div id='notrequired'>
并且它可以工作,但无法精确处理它。
这是片段代码:
function DOMRemove(DOMNode $from) {
$from->parentNode->removeChild($from);
}
$dom = new DOMDocument();
@$dom->loadHTML($text); //$text contains the above mentioned HTML
$selection = $dom->getElementById('notrequired');
if($selection == NULL){
$text = $dom->saveXML();
}else{
$refine = DOMRemove($selection);
$text = $dom->saveXML($refine);
}
问题是$dom->saveXML
另存为 HTML 内容:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<a target="_blank" href="http://en.wikipedia.org/wiki/Library_%28computing%29">Read more</a>
</body>
</html>
我只需要:
<a target='_blank' href='http://en.wikipedia.org/wiki/Library_%28computing%29'>Read more</a>
而不是<HTML>
and<BODY>
标签。
我错过了什么?还有其他更好的方法吗?