html - jtidy 无法解析 html - 选项

翻译自：https://stackoverflow.com/questions/16227009 2013-04-26T00:44:15.370

675 次

所以我试图评估几个 HTML 解析器并尝试了 JTidy。尝试解析此 URL：

http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/TagNode.html

给出这些错误：

第 1 行第 56,258 列 - 错误：标签结尾缺少“>”

第 1 行第 56,258 列 - 错误：无法识别！

它说第一行，因为它作为一行读入，但这是 JTidy 呕吐/失败的那一行：

      <li>//div[last() >= 4]//./div[position() = last()])[position() > 22]//li[2]//a</li>

我的代码很简单：

import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;

Document document = tidy.parseDOM(new ByteArrayInputStream(this.getHtml().getBytes()), null);
NodeList anchorTags = document.getElementsByTagName("A");

这只是 JTidy 中的一个错误还是我做错了什么？到目前为止，我已经评估了大约 6 个其他人，他们都没有在这个页面上遇到问题。

html - jtidy 无法解析 html - 选项

0 回答 0

Related

Reference