html-parsing - 如何将 Jsoup 文档转换为 W3C 文档？

Question

我通过解析内部 HTML 页面构建了一个 Jsoup 文档，

public Document newDocument(String path) throws IOException {

    Document doc = null;
    doc = Jsoup.connect(path).timeout(0).get();
            return new HtmlDocument<Document>(doc);
}

我想将 Jsoup 文档转换为我的org.w3c.dom.Document 我为此使用了一个可用的库DOMBuilder，但是在解析时我得到org.w3c.dom.Document了 null。我无法理解这个问题，尝试搜索但找不到任何答案。

生成 W3C DOM 文档的代码：

Document jsoupDoc=factory.newDocument("http:localhost/testcases/test_2.html"));
org.w3c.dom.Document docu= DOMBuilder.jsoup2DOM(jsoupDoc);

谁能帮我解决这个问题？

score 20 · Accepted Answer

或者，Jsoup 为 W3CDom 类提供方法fromJsoup。此方法将 Jsoup 文档转换为 W3C 文档。

Document jsoupDoc = ...
W3CDom w3cDom = new W3CDom();
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(jsoupDoc);

更新：

从 1.10.3开始，W3CDom不再是实验性的。
直到 Jsoup 1.10.2 W3CDom 类仍然是实验性的。

score 6 · Accepted Answer

要通过 HTTP 检索 jsoup 文档，请调用Jsoup.connect(...).get(). 要在本地加载 jsoup 文档，请调用Jsoup.parse(new File("..."), "UTF-8").

调用DomBuilder是正确的。

当你说，

我为此使用了一个可用的库 DOMBuilder，但是在解析时我得到 org.w3c.dom.Document 为空。

我想你的意思是，“我为此使用了一个可用的库 DOMBuilder，但是在打印结果时，我得到了[#document: null]。” 至少，这是我尝试打印w3cDoc对象时看到的结果——但这并不意味着对象为空。我能够通过调用getDocumentElement和来遍历文档getChildNodes。

public static void main(String[] args) {
    Document jsoupDoc = null;

    try {
        jsoupDoc = Jsoup.connect("http://stackoverflow.com/questions/17802445").get();
    } catch (IOException e) {
        e.printStackTrace();
    }

    org.w3c.dom.Document w3cDoc= DOMBuilder.jsoup2DOM(jsoupDoc);
    Element e = w3cDoc.getDocumentElement();
    NodeList childNodes = e.getChildNodes();
    Node n = childNodes.item(2);
    System.out.println(n.getNodeName());
}

html-parsing - 如何将 Jsoup 文档转换为 W3C 文档？

2 回答 2

Related

Reference