xml - 递归地对任意 XML 文档的元素进行排序

Question

我正在尝试对一些 XML 文档进行排序和规范化。期望的最终结果是：

每个元素的子元素都按字母顺序排列
每个元素的属性都按字母顺序排列
评论被删除
所有元素的间距都适当（即“漂亮的打印”）。

除了#1，我已经实现了所有这些目标。

我一直使用这个答案作为我的模板。这是我到目前为止所拥有的：

import javax.xml.transform.stream.StreamResult
import javax.xml.transform.stream.StreamSource
import javax.xml.transform.TransformerFactory
import org.apache.xml.security.c14n.Canonicalizer

// Initialize the security library
org.apache.xml.security.Init.init()

// Create some variables

// Get arguments

// Make sure required arguments have been provided

if(!error) {
    // Create some variables
    def ext = fileInName.tokenize('.').last()
    fileOutName = fileOutName ?: "${fileInName.lastIndexOf('.').with {it != -1 ? fileInName[0..<it] : fileInName}}_CANONICALIZED_AND_SORTED.${ext}"
    def fileIn = new File(fileInName)
    def fileOut = new File(fileOutName)
    def xsltFile = new File(xsltName)
    def temp1 = new File("./temp1")
    def temp2 = new File("./temp2")
    def os
    def is

    // Sort the XML attributes, remove comments, and remove extra whitespace
    println "Canonicalizing..."
    Canonicalizer c = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS)
    os = temp1.newOutputStream()
    c.setWriter(os)
    c.canonicalize(fileIn.getBytes())
    os.close()

    // Sort the XML elements
    println "Sorting..."
    def factory = TransformerFactory.newInstance()
    is = xsltFile.newInputStream()
    def transformer = factory.newTransformer(new StreamSource(is))
    is.close()
    is = temp1.newInputStream()
    os = temp2.newOutputStream()
    transformer.transform(new StreamSource(is), new StreamResult(os))
    is.close()
    os.close()

    // Write the XML output in "pretty print"
    println "Beautifying..."
    def parser = new XmlParser()
    def printer = new XmlNodePrinter(new IndentPrinter(fileOut.newPrintWriter(), "    ", true))
    printer.print parser.parseText(temp2.getText())

    // Cleanup
    temp1.delete()
    temp2.delete()

    println "Done!"
}

完整的脚本在这里。

XSLT：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="foo">
    <foo>
      <xsl:apply-templates>
        <xsl:sort select="name()"/>
      </xsl:apply-templates>
    </foo>
  </xsl:template>
</xsl:stylesheet>

示例输入 XML：

<foo b="b" a="a" c="c">
    <qwer>
    <zxcv c="c" b="b"/>
    <vcxz c="c" b="b"/>
    </qwer>
    <baz e="e" d="d"/>
    <bar>
    <fdsa g="g" f="f"/>
    <asdf g="g" f="f"/>
    </bar>
</foo>

所需的输出 XML：

<foo a="a" b="b" c="c">
    <bar>
        <asdf f="f" g="g"/>
        <fdsa f="f" g="g"/>
    </bar>
    <baz d="d" e="e"/>
    <qwer>
        <vcxz b="b" c="c"/>
        <zxcv b="b" c="c"/>
    </qwer>
</foo>

如何使转换应用于所有元素，以便元素的所有子元素都按字母顺序排列？

score 8 · Accepted Answer

如果要使转换应用于所有元素，则需要一个模板来匹配所有元素，而不是拥有一个仅匹配特定“foo”元素的模板

<xsl:template match="*">

请注意，您必须更改与“node()”匹配的当前模板以排除元素：

 <xsl:template match="node()[not(self::*)]|@*">

在此模板中，您还需要代码来选择属性，因为此时您的“foo”模板将忽略它们（<xsl:apply-templates />不选择属性）。

实际上，看看您的要求，第 1 到第 3 项都可以使用单个 XSLT 完成。例如，要删除评论，您可以从当前匹配 node() 的模板中忽略它

<xsl:template match="node()[not(self::comment())][not(self::*)]|@*">

试试下面的 XSLT，应该会达到第 1 到第 3 点

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()[not(self::comment())][not(self::*)]|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="@*">
        <xsl:sort select="name()"/>
      </xsl:apply-templates>
      <xsl:apply-templates>
        <xsl:sort select="name()"/>
      </xsl:apply-templates>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

编辑：模板<xsl:template match="node()[not(self::comment())][not(self::*)]|@*">实际上可以替换为<xsl:template match="processing-instruction()|@*">可能增加可读性的模板。这是因为“node()”匹配元素、文本节点、注释和处理指令。在您的 XSLT 中，元素由另一个模板拾取，文本节点由内置模板拾取，您想忽略的注释只留下处理指令。

score 3 · Accepted Answer

为了好玩，您也可以以编程方式执行此操作：

def x = '''<foo b="b" a="a" c="c">
    <qwer>
    <!-- A comment -->
    <zxcv c="c" b="b"/>
    <vcxz c="c" b="b"/>
    </qwer>
    <baz e="e" d="d"/>
    <bar>
    <fdsa g="g" f="f"/>
    <asdf g="g" f="f"/>
    </bar>
</foo>'''

def order( node ) {
    [ *:node.attributes() ].sort().with { attr ->
        node.attributes().clear()
        attr.each { node.attributes() << it }
    }
    node.children().sort { it.name() }
                   .each { order( it ) }
    node
}

def doc = new XmlParser().parseText( x )

println groovy.xml.XmlUtil.serialize( order( doc ) )

如果您的节点有内容，那么您需要更改为：

def x = '''<foo b="b" a="a" c="c">
    <qwer>
    <!-- A comment -->
    <zxcv c="c" b="b">Some Text</zxcv>
    <vcxz c="c" b="b"/>
    </qwer>
    <baz e="e" d="d">Woo</baz>
    <bar>
    <fdsa g="g" f="f"/>
    <asdf g="g" f="f"/>
    </bar>
</foo>'''

def order( node ) {
    [ *:node.attributes() ].sort().with { attr ->
        node.attributes().clear()
        attr.each { node.attributes() << it }
    }
    node.children().sort()
                   .grep( Node )
                   .each { order( it ) }
    node
}

def doc = new XmlParser().parseText( x )

println groovy.xml.XmlUtil.serialize( order( doc ) )

然后给出：

<?xml version="1.0" encoding="UTF-8"?><foo a="a" b="b" c="c">
  <baz d="d" e="e">Woo</baz>
  <bar>
    <fdsa f="f" g="g"/>
    <asdf f="f" g="g"/>
  </bar>
  <qwer>
    <zxcv b="b" c="c">Some Text</zxcv>
    <vcxz b="b" c="c"/>
  </qwer>
</foo>

xml - 递归地对任意 XML 文档的元素进行排序

2 回答 2

Related

Reference