c++ - Xerces-C：在 HTML 中解析 Javascript

Question

我想解析网站的元标记。为此，我使用 xerces-c。

shared_ptr<SAX2XMLReader> parser(XMLReaderFactory::createXMLReader());

//Create and set callback handler with the given callback functions
Handler handler(startElement,endElement,characters);
parser->setContentHandler(&handler);
parser->setErrorHandler(&handler);

//Parse the file with the given callback handler
parser->parse(filename.c_str());

一些网站现在有 javascript。在脚本标签内部，javascript 使用运算符 && 表示逻辑与。

Xerces-C 将此解释为实体引用（例如）并抛出异常，因为它不知道实体引用 &&。

有没有办法将它作为文本正确阅读？

或者如果不是 - 有没有办法忽略脚本标签内的所有字符？反正我不需要它们。我只想解析元标记。

score 2 · Accepted Answer

基本上，html不一定是格式良好的xml，但例如，您可以tidy在输入 xml 解析器之前对其进行预处理。

c++ - Xerces-C：在 HTML 中解析 Javascript

1 回答 1

Related

Reference