我正在使用 tagoup 清理我从互联网上抓取的一些 HTML,并且在解析带有注释的页面时出现以下错误:
The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen.
我正在使用 JDOM 1.1,这是进行实际清理的代码:
SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build
// Don't check the doctype! At our usage rate, we'll get 503 responses
// from the w3.
builder.setEntityResolver(dummyEntityResolver);
Reader in = new StringReader(str);
org.jdom.Document doc = builder.build(in);
String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc);
知道出了什么问题,或者如何解决这个问题?我需要能够解析具有长注释字符串的页面<!--------- data ------------>