html - 使用 XmlSlurper 时如何找到违规行

Question

我正在使用 XmlSlurper 解析一个脏 html 页面，我收到以下错误：

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

现在，我有 html 我提供它并在这样做之前打印它。如果我打开它并尝试转到错误中提到的行 1157，那里没有“src”（但文件中有数百个这样的字符串）。所以我猜想插入一些额外的东西（可能<script>或类似的东西）会改变行号。

有没有一种好方法可以准确地找到有问题的行或 html 片段？

score 0 · Accepted Answer

您使用的是哪个 SAXParser？HTML 不是严格的 XML，因此将 XMLSlurper 与默认解析器一起使用可能会导致持续错误。

粗略的谷歌搜索“Groovy html slurper”导致我使用 Groovy 进行 HTML Scraping，它指向一个名为TagSoup的 SaxParser 。

试一试，看看它是否解析了脏页。

score 0 · Accepted Answer

您可以为每个元素添加一个名为 _lineNum 的属性，然后可以使用该属性。

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

上面添加了 line num 属性。您也许可以尝试设置自己的错误处理程序，它可以从定位器中读取行号。

html - 使用 XmlSlurper 时如何找到违规行

2 回答 2

Related