6

每个人都知道我们应该始终使用 DOM 技术而不是正则表达式来从 HTML 中提取内容,但我感觉我永远无法信任 SimpleXML 扩展或类似扩展。

我现在正在编写 OpenID 实现,我尝试使用 SimpleXML 进行 HTML 发现 - 但我的第一个测试(使用 alixaxel.myopenid.com)产生了很多错误:

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 27: parser error : Opening and ending tag mismatch: link line 11 and head in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: </head> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 64: parser error : Entity 'copy' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: &copy; 2008 <a href="http://janrain.com/">JanRain, Inc.</a> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID&trade; and the myOpenID&trade; website are in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID&trade; and the myOpenID&trade; website are in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 77: parser error : Opening and ending tag mismatch: link line 8 and html in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: </html> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag head line 3 in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag html line 2 in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

我记得有一种方法可以让 SimpleXML 始终解析文件,无论文档是否包含错误 - 我不记得具体的实现,但我认为它涉及使用 DOMDocument。确保 SimpleXML 始终解析任何给定文档的最佳方法是什么?

请不要建议使用 Tidy,我认为该扩展程序很慢,并且在许多系统上都不可用。

4

2 回答 2

11

您可以使用DOM 的 loadHTML加载 HTML,然后将结果导入 SimpleXML。

IIRC,它仍然会在某些东西上窒息,但它会接受现实世界中存在的几乎所有损坏网站的东西。

$html = '<html><head><body><div>stuff & stuff</body></html>';

// disable PHP errors
$old = libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML($html);

// restore the old behaviour
libxml_use_internal_errors($old);

$sxe = simplexml_import_dom($dom);
die($sxe->asXML());
于 2010-07-22T19:12:58.830 回答
0

您总是可以尝试使用 SAX 解析器...对错误更稳健一点。

在大型 XML 上可能效率不高。

于 2010-07-22T19:00:03.890 回答