groovy - 在 Groovy 中使用命名空间和实体解析 XML

Question

在 Groovy 中解析 XML 应该是小菜一碟，但我总是遇到问题。

我想解析这样的字符串：

<html>
<p>
This&nbsp;is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>

当我以标准方式执行此操作时new XmlSlurper().parseText(body)，解析器会抱怨该&nbsp实体。在这种情况下，我的秘密武器是使用 tagoup：

def parser = new org.ccil.cowan.tagsoup.Parser()
def page = new XmlSlurper(parser).parseText(body)

但是现在这个<ac:sepcial>标签会被解析器立即关闭——special文本不会在生成的 dom 中的这个标签内。即使我禁用了命名空间功能：

def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)

另一种方法是使用标准解析器并添加一个像这样的文档类型：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

这似乎适用于我的大多数文件，但解析器需要很长时间才能获取 dtd 并对其进行处理。

任何好主意如何解决这个问题？

PS：这里有一些示例代码可以玩：

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='0.9.7')
def processNode(node) {
    def out = new StringBuilder("")
    node.children.each {
        if (it instanceof String) {
            out << it
        } else {
            out << "<${it.name()}>${processNode(it)}</${it.name()}>"
        }
    }
    return out.toString()
}

def body = """<html>
<p>
This&nbsp;is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>"""

def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)
def out = new StringBuilder("")
page.childNodes().each {
    out << processNode(it)
}
println out.toString()
""

score 2 · Accepted Answer

您将必须决定是否希望解析符合标准、采用 DTD 路径，还是只接受任何具有许可解析器的内容。

根据我的经验，Tagsoup 对后者来说很好，而且它很少会产生任何问题，所以我很惊讶地看到你关于它处理“特殊”的评论。快速测试还表明我无法重现它：运行此命令时

  java net.sf.saxon.Query -x:org.ccil.cowan.tagsoup.Parser -s:- -qs:. !encoding=ASCII !indent=yes

在你的样品上，我收到了这个结果

<?xml version="1.0" encoding="ASCII"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml">
   <body>
      <p>
    This&#xa0;is a <span>test</span> with <b>some</b> formattings.<br clear="none"/>
    And this has a <ac:special xmlns:ac="urn:x-prefix:ac">special</ac:special> formatting.
  </p>

   </body>
</html>

来自 TagSoup 1.2 和 1.2.1。所以对我来说，表现如预期，文本“特殊”出现在“ac:special”标签内。

至于 DTD 变体，您可以通过缓存代理来解析 DTD，参考本地副本，甚至将 DTD 减少到您需要的最低限度。以下内容应该足以让您了解整个 实体：

<!DOCTYPE DOC[<!ENTITY nbsp "&#160;">]>

groovy - 在 Groovy 中使用命名空间和实体解析 XML

1 回答 1

Related

Reference