java - Possible to parse a HTML document and build a DOM tree(java)

Question

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API.

For example:

DomRoot = parse("myhtml.html");

for (tags : DomRoot) {
}

Note: this is a HTML document not XHtml.

score 4 · Accepted Answer

您可以使用TagSoup - 它是一个符合 SAX 的解析器，可以将格式错误的内容（例如 HTML）从通用网页清除为格式良好的 XML。

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

score 2 · Accepted Answer

JTidy应该让你做你想做的事。

用法相当简单，但解析是可配置的。例如：

InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();

JavaDoc 托管在这里。

score 1 · Accepted Answer

您可以查看NekoHTML，这是一个 Java 库，可在您的文档中尽最大努力进行清理和标记平衡。这是一种解析格式错误的 HTML（或无效的 XML）文件的简单方法。

它是在 Apache 2.0 许可下分发的。

score 0 · Accepted Answer

HTML Parser似乎支持从 HTML 到 XML 的转换。然后，您可以使用常用的 Java 工具链构建 DOM 树。

score 0 · Accepted Answer

有几个开源工具可以从 Java 中解析 HTML。

检查http://java-source.net/open-source/html-parsers

您也可以查看这个问题的答案： Reading HTML file to DOM tree using Java 它几乎是一样的......

java - Possible to parse a HTML document and build a DOM tree(java)

5 回答 5

Related

Reference