html - 用 groovy 提取部分 HTML

Question

我需要从给定的 HTML 页面中提取 HTML 的一部分。到目前为止，我使用带有 tagsoup 的 XmlSlurper 来解析 HTML 页面，然后尝试使用 StreamingMarkupBuilder 获取所需的部分：

import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def dom = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText(html)
println    new StreamingMarkupBuilder().bindNode(dom.body)

但是，我得到的结果是

<html:body xmlns:html='http://www.w3.org/1999/xhtml'>a <html:b>test</html:b></html:body>

看起来不错，但我想在没有 html 命名空间的情况下获得它。

如何避免命名空间？

score 7 · Accepted Answer

关闭 TagSoup 解析器上的命名空间功能。例子：

import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature, false)
def dom = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)

html - 用 groovy 提取部分 HTML

1 回答 1

Related

Reference