java - 用cyberneko解析html以找到一个'div'-tag

Question

我需要一个来自 html 站点的特定 'div'-tag（由 'id' 标识）。为了解析我正在使用cyberneko的页面。

    def doc = new XmlParser( new org.cyberneko.html.parsers.SAXParser() ).parse(htmlFile)
    divTag = doc.depthFirst().DIV.find{ it['@id'] == tagId  }

到目前为止没问题，但最后我不需要XML，而是整个'div'标签的原始内容。不幸的是，我无法弄清楚如何做到这一点......

score 1 · Accepted Answer

编辑：回应第一条评论。

这有效：

def html = """
  <body>
        <div id="breadcrumbs">
            <b>
            crumb1
            </b>
        </div>
</body>
"""

def doc = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(html)
divTag = doc.BODY.DIV.find { it.@id == 'breadcrumbs'  }
println "" << new groovy.xml.StreamingMarkupBuilder().bind {xml -> xml.mkp.yield divTag}

看起来cyberneko会返回一个格式良好的HTML文档，不管原始标记是否是。即，文档的根将是一个 HTML 元素，并且还会有一个 HEAD 元素。整洁的。

score 0 · Accepted Answer

这是一个基于诺亚回答的简单测试 - 不幸的是它（还）不起作用:(

    def html = """
      <body>
            <div id="breadcrumbs">
                <b>
                crumb1
                </b>
            </div>
    </body>
    """

    def doc = new XmlSlurper( new org.cyberneko.html.parsers.SAXParser() ).parseText(html)
    println "document: $doc"
    def htmlTag = doc.DIV.find {
        println "-> $it"
        it['@id'] == "breadcrumbs"
    }
    println htmlTag
    assert htmlTag

java - 用cyberneko解析html以找到一个'div'-tag

2 回答 2

Related

Reference