xml - 混合内容和字符串操作清理

Question

我正处于将基于 Word 的文档转换为 XML 的非常痛苦的过程中。我遇到了以下问题：

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <p>
        <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">Is this a
            quote</hi>?” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">This is a
            quote</hi>” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">This is
            definitely a quote</hi>!” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text.„&lt;hi rend="italics">This is a
            first quote</hi>” (Source). „&lt;hi rend="italics">Sometimes there is a second quote as
            well</hi>!?” (Source). </p>

</root>

<p>节点具有混合内容。<element>我已经在之前的迭代中处理过。但现在问题在于部分出现在文本节点内<hi rend= "italics"/>和部分作为文本节点的引用和来源。

如何使用 XSLT 2.0 来：

匹配<hi rend="italics">最后一个字符为“„”的文本节点之前的所有节点？
输出<hi rend="italics">as的内容<quote>...</quote>，去掉引号（“„”和“””），但在<quote/>紧跟在<hi rend="italics">?的同级之后出现的任何问号和感叹号中包括
将节点后面的“（”和“）”之间的文本节点转换<hi rend="italics">为<source>...</source>不带括号的节点。
包括最后的句号。

换句话说，我的输出应该是这样的：

<root>
<p>
<element>This one is taken care of.</element> Some more text. <quote>Is this a quote?</quote> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a quote</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is definitely a quote!</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as well!?</quote> <source>Source</source>. 
</p>

</root>

我从来没有处理过像这样的混合内容和字符串操作，整个事情真的让我失望。我将非常感谢您的提示。

score 2 · Accepted Answer

这种转变：

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "hi[@rend='italics'
     and
      preceding-sibling::node()[1][self::text()[ends-with(., '„')]]
      ]">

  <quote>
    <xsl:value-of select=
     "concat(.,
             if(matches(following-sibling::text()[1], '^[?!]+'))
              then replace(following-sibling::text()[1], '^([?!]+).*$', '$1')
              else()
             )
      "/>
  </quote>
 </xsl:template>

 <xsl:template match="text()[true()]">
  <xsl:variable name="vThis" select="."/>
  <xsl:variable name="vThis2" select="translate($vThis, '„”?!', '')"/>

  <xsl:value-of select="substring-before(concat($vThis2, '('), '(')"/>
  <xsl:if test="contains($vThis2, '(')">
   <source>
    <xsl:value-of select=
      "substring-before(substring-after($vThis2, '('), ')')"/>
   </source>
   <xsl:value-of select="substring-after($vThis2, ')')"/>
  </xsl:if>
 </xsl:template>
</xsl:stylesheet>

应用于提供的 XML 文档时：

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">Is this a
                quote</hi>?” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">This is a
                quote</hi>” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „&lt;hi rend="italics">This is
                definitely a quote</hi>!” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text.„&lt;hi rend="italics">This is a
                first quote</hi>” (Source). „&lt;hi rend="italics">Sometimes there is a second quote as
                well</hi>!?” (Source). </p>

</root>

产生想要的正确结果：

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. <quote>Is this a
                quote?</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is a
                quote</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is
                definitely a quote!</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text.<quote>This is a
                first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as
                well!?</quote> <source>Source</source>. </p>

</root>

score 1 · Accepted Answer

这是一个替代解决方案。它允许更多叙述风格的输入文档（引号内的引号，一个文本节点内的多个（源）片段，'„' 作为数据，当后面没有 hi 元素时）。

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:so="http://stackoverflow.com/questions/12690177"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xsl xs so">
<xsl:output omit-xml-declaration="yes" indent="yes" />
<xsl:strip-space elements="*" />  

<xsl:template match="@*|comment()|processing-instruction()">
  <xsl:copy />
</xsl:template>

<xsl:template match="*">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()" />
  </xsl:copy>
</xsl:template>

<xsl:function name="so:clip-start" as="xs:string">
  <xsl:param name="in-text" as="xs:string" />
  <xsl:value-of select="substring($in-text,1,string-length($in-text)-1)" />
</xsl:function>

<xsl:function name="so:clip-end" as="xs:string">
  <xsl:param name="in-text" as="xs:string" />
  <xsl:value-of select="substring-after($in-text,'”')" />
</xsl:function>

<xsl:function name="so:matches-start" as="xs:boolean">
  <xsl:param name="text-node" as="text()" />
  <xsl:value-of select="$text-node/following-sibling::node()/self::hi[@rend='italics'] and
                        ends-with($text-node, '„')" />
</xsl:function>

<xsl:template match="text()[so:matches-start(.)]"    priority="2">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-start(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:function name="so:matches-end" as="xs:boolean">
  <xsl:param name="text-node" as="text()" />
  <xsl:value-of select="$text-node/preceding-sibling::node()/self::hi[@rend='italics'] and
                        matches($text-node,'^[!?]*”')" />
</xsl:function>

<xsl:template match="text()[so:matches-end(.)]"   priority="2">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-end(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:template match="text()[so:matches-start(.)][so:matches-end(.)]" priority="3">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-end(so:clip-start(.))" />
  </xsl:call-template>
</xsl:template>

<xsl:template match="text()" name="parse-text" priority="1">
  <xsl:param name="text" select="." />
  <xsl:analyze-string select="$text" regex="\(([^)]*)\)">
    <xsl:matching-substring>
      <source>
        <xsl:value-of select="regex-group(1)" />
      </source>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="." />
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>

<xsl:template match="hi[@rend='italics']">
  <quote>
    <xsl:apply-templates select="(@* except @rend) | node()" />
    <xsl:for-each select="following-sibling::node()[1]/self::text()[matches(.,'^[!?]')]">
      <xsl:value-of select="replace(., '^([!?]+).*$', '$1')" />
    </xsl:for-each>   
  </quote>
</xsl:template>

</xsl:stylesheet>

xml - 混合内容和字符串操作清理

2 回答 2

Related

Reference