java - JDOM 1.1：连字符不是有效的注释字符

Question

我正在使用 tagoup 清理我从互联网上抓取的一些 HTML，并且在解析带有注释的页面时出现以下错误：

The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen.

我正在使用 JDOM 1.1，这是进行实际清理的代码：

    SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build
    // Don't check the doctype! At our usage rate, we'll get 503 responses
    // from the w3.
    builder.setEntityResolver(dummyEntityResolver);
    Reader in = new StringReader(str);
    org.jdom.Document doc = builder.build(in);
    String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc);

知道出了什么问题，或者如何解决这个问题？我需要能够解析具有长注释字符串的页面

score 1 · Accepted Answer

XML/HTML/SGML 注释以开头--、以结尾--且不包含--。注释声明包含零个或多个注释。

您的示例字符串可以重新格式化为：

<!----
  ----
  - data
  ----
  ----
  ---->

如您所见，- data不是有效的评论，因此该文档不是有效的 HTML。在您的特定情况下，您可以通过将正则表达式替换/<?!--.*?-->/为空字符串来修复它，但请注意，此更改也可能会破坏一些有效文档。

java - JDOM 1.1：连字符不是有效的注释字符

1 回答 1

Related

Reference