xml - 用 XSLT 替换 XML 属性中的换行符

Question

我需要一些 XSLT（或其他东西- 见下文）用替代字符替换所有属性中的换行符。

我必须处理将所有数据存储为属性的遗留 XML，并使用换行符来表达基数。例如：

<sample>
    <p att="John
    Paul
    Ringo"></p>
</sample>

当我在 Java 中解析文件时（根据 XML 规范），这些换行符被替换为空格，但是我希望将它们视为一个列表，因此这种行为并不是特别有用。

我的“解决方案”是使用 XSLT 将所有属性中的所有换行符替换为其他分隔符 - 但我对 XSLT 的了解为零。到目前为止，我看到的所有示例要么非常具体，要么替换了节点内容而不是属性值。

我曾涉足 XSLT 2.0，replace()但很难将所有东西放在一起。

XSLT 甚至是正确的解决方案吗？使用下面的 XSLT：

<xsl:template match="sample/*">
    <xsl:for-each select="@*">
        <xsl:value-of select="replace(current(), '\n', '|')"/>
    </xsl:for-each>
</xsl:template>

应用于示例 XML 使用 Saxon 输出以下内容：

John Paul Ringo

显然这种格式不是我所追求的——这只是为了试验replace()——但是当我们进行 XSLT 处理时，换行符是否已经被规范化了？如果是这样，是否有任何其他方法可以使用 Java 解析器将这些值解析为书面形式？到目前为止，我只使用过 JAXB。

score 2 · Accepted Answer

这似乎很难做到。正如我在 XML 属性值中是否允许换行符中发现的那样？- 属性中的换行符有效，但 XML 解析器对其进行了规范化（https://stackoverflow.com/a/8188290/1324394），因此它可能在处理之前丢失（因此在替换之前）。

score 1 · Accepted Answer

我已经通过使用JSoup预处理 XML 解决了这个问题（这是对@Ian Roberts 关于使用非 XML 工具解析 XML 的评论的一个点头）。JSoup 是（或曾经）为 HTML 文档设计的，但是在这种情况下运行良好。

我的代码如下：

@Test
public void verifyNewlineEscaping() {
    final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");

    fixAttributeNewlines(nodes);

    // Reconstruct XML
    StringBuilder output = new StringBuilder();
    for (Node node : nodes) {
        output.append(node.toString());
    }

    // Print cleansed output to stdout
    System.out.println(output);
}

/**
 * Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
 * order to avoid whitespace normalisation converting newlines to a single space.
 * 
 * <p>
 * This is useful if newlines which have semantic value have been incorrectly inserted into
 * attribute values.
 * </p>
 * 
 * @param nodes nodes to update
 */
private static void fixAttributeNewlines(final List<Node> nodes) {

    /*
     * Recursively iterate over all attributes in all nodes in the XML document, performing
     * attribute string replacement
     */
    for (final Node node : nodes) {
        final List<Attribute> attributes = node.attributes().asList();

        for (final Attribute attribute : attributes) {

            // JSoup reports whitespace as attributes
            if (!StringUtils.isWhitespace(attribute.getValue())) {
                attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
            }
        }

        // Recursively process child nodes
        if (!node.childNodes().isEmpty()) {
            fixAttributeNewlines(node.childNodes());
        }
    }
}

对于我的问题中的示例 XML，此方法的输出是：

<sample> 
    <p att="John|Paul|Ringo"></p> 
</sample>

请注意，我没有使用
，因为 JSoup 在其字符转义和属性值中的所有内容转义方面相当警惕。它还将现有的数字实体引用替换为其等效的 UTF-8，因此时间会证明这是否是一个可以通过的解决方案。

score 0 · Accepted Answer

XSLT 仅在 XML 解析器处理 XML 后才能看到 XML，XML 解析器将完成属性值规范化。

我认为一些 XML 解析器可以选择抑制属性值规范化。如果您无法访问这样的解析器，我认为
在解析之前对 (\r?\n) 进行文本替换可能是您最好的逃生路线。以这种方式转义的换行符不会被属性值规范化所破坏。

xml - 用 XSLT 替换 XML 属性中的换行符

3 回答 3

Related

Reference