java - 将 xml 字符串传输到 org.w3c.dom.Document 时忽略 org.xml.sax.SAXParseExceptions？

Question

我有很多 html 页面（我的意思是它的源代码）表示为 java.Util.List of Strings in Java。我需要将它转换为 Java 中的 Document 对象（来自包 org.w3c.dom）。

我用 DocumentBuilderFactory 和 Document 这样做：

public static org.w3c.dom.Document inputStream2Document(InputStream inputStream) throws IOException, SAXException, ParserConfigurationException {
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setValidating(false);
    org.w3c.dom.Document parse = dbf.newDocumentBuilder().parse(inputStream);
    return parse;
}

一些页面以正确的方式转换，但存在一个问题，例如有一些其他页面具有错误的写入属性并且它是无效的（没有 =“”的属性......所以它看起来像

<a href="somepage.html" someattr>

错误的书面属性称为“someattr”）。在这种情况下，我会遇到异常，例如

Nested exception: org.xml.sax.SAXParseException; lineNumber: 7558; columnNumber: 71; Element type "a" must be followed by either attribute specifications, ">" or "/>".

或者

Nested exception: org.xml.sax.SAXParseException; lineNumber: 109; columnNumber: 32; The string "--" is not permitted within comments.

有什么方法可以告诉 DocumentBuilderFactory 他应该忽略这个异常吗？我也想将这些页面转换为文档，我不介意它们无效。

score 1 · Accepted Answer

<a href="somepage.html" someattr>不是 XML，因此 XML 解析器将永远无法解析它，但它看起来确实像合理的HTML，因此您可以尝试使用诸如NekoHTML之类的 HTML 解析器来代替 XML 解析器。NekoHTML 的使用页面上有很好的示例，展示了如何将完整的文档和 HTML 片段解析为 DOM 节点。

import org.cyberneko.html.parsers.DOMParser;
import org.xml.sax.InputSource;
import org.w3c.dom.Document;
import java.io.StringReader;

DOMParser parser = new DOMParser();
InputSource in = new InputSource(new StringReader(theHtmlString));
parser.parse(in);
Document doc = parser.getDocument();

score 0 · Accepted Answer

XML 解析器只能解析格式良好的 XML（或者，同样是 XHTML）。给出错误的页面格式不正确，即它们不是 XML，因此 XML 解析器根本不适合。

但是，如果唯一的问题是存在没有值的此类属性，您可以尝试预处理输入文件以使用正则表达式删除这些属性。

java - 将 xml 字符串传输到 org.w3c.dom.Document 时忽略 org.xml.sax.SAXParseExceptions？

2 回答 2

Related

Reference