1

有点卡在这个上。数据以以下格式提供(非重要内容被剪断):

<?xml version="1.0" encoding="UTF-8"?>
<Content Type="Statutes">
  <Indexes>
    <!--SNIP-->
    <Index Level="3" HasChildren="0">
      <!--SNIP-->
      <Content>&lt;p&gt; (1)(a)The statutes ... &lt;/p&gt;&lt;p&gt; (b)To ensure public ..: &lt;/p&gt;&lt;p&gt; 
            (I)Shall authorize ...; &lt;/p&gt;&lt;p&gt; (II)May authorize and ...: &lt;/p&gt;&lt;p&gt; (A)Compact disks; 
            &lt;/p&gt;&lt;p&gt; (B)On-line public ...; &lt;/p&gt;&lt;p&gt; (C)Electronic applications for ..; 
            &lt;/p&gt;&lt;p&gt; (D)Electronic books or ... &lt;/p&gt;&lt;p&gt; (E)Other electronic products or formats; 
            &lt;/p&gt;&lt;p&gt; (III)May, pursuant ... &lt;/p&gt;&lt;p&gt; (IV)Recognizes that ... &lt;/p&gt;&lt;p&gt; 
            (2)(a)Any person, ...: &lt;/p&gt;&lt;p&gt; (I)A statement specifying ...; &lt;/p&gt;&lt;p&gt; (II)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (3)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (4)A statement 
            specifying ...; &lt;/p&gt;</Content>
    </Index>
    <!--SNIP-->
  </Indexes>
</Content>

需要获取包含语义层次结构的元素Content的文本值:

(1)
 +-(a)
    +-(I)
       +-(A)

...并通过 XSLT 2.0 转换作为父子元素关系作为最终输出放置:

    <?xml version="1.0" encoding="UTF-8"?>
    <law>
       <!--SNIP-->
       <content>
          <section prefix="(1)">
            <section prefix="(a)">The statutes ...
            <section prefix="(b)">To ensure public ..:
              <section prefix="(I)">Shall authorize ...;</section>
              <section prefix="(II)">May authorize and ...:
                <section prefix="(A)">Compact disks;</section>
                <section prefix="(B)">On-line public ...;</section>
                <section prefix="(C)">Electronic applications for ..;</section>
                <section prefix="(D)">Electronic books or ...</section>
                <section prefix="(E)">Other electronic products or formats;</section>
              </section>
              <section prefix="(III)">May, pursuant ...</section>
              <section prefix="(IV)">Recognizes that ...</section>        
            </section>      
          </section>
          <section prefix="(2)">
            <section prefix="(a)">Any person, ...:
              <section prefix="(I)">A statement specifying ...;</section>
              <section prefix="(II)">A statement specifying ...;</section>
            </section>      
          </section>
          <section prefix="(3)">Level 1 node with no children</section>
       </content>
    </law>

我能够从 Content 的文本值中标记结束 html 编码的 P 标签,但不知道如何获取动态创建的元素以在条件条件下创建子元素。

我的 XSLT 2.0 样式表:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">
        <content>
            <!-- Loop through HTML encoded P tag endings -->
            <xsl:for-each select="tokenize(.,'&lt;/p&gt;')">

                <!-- Set Token to a variable and remove P opening tags -->
                <xsl:variable name="sectionText">
                    <xsl:value-of select="normalize-space(replace(current(),'&lt;p&gt;',''))"/>  
                </xsl:variable>    

                <!-- Output -->
                <xsl:if test="string-length($sectionText)!=0">
                    <section>
                        <!-- Set the section element's prefix attribute (if exists) -->
                        <xsl:analyze-string select="$sectionText" regex="^(\(([\w]+)\)){{1,3}}">
                            <xsl:matching-substring >
                                <xsl:attribute name="prefix" select="." />
                            </xsl:matching-substring>
                        </xsl:analyze-string>

                        <!-- Set the section element's value -->
                        <xsl:value-of select="$sectionText"/>
                    </section>
                </xsl:if>

            </xsl:for-each>
        </content>
    </xsl:template>
</xsl:stylesheet> 

...这让我走到了这一步 - 在部分元素中没有语义层次结构:

<?xml version="1.0" encoding="UTF-8"?>
<law>
   <structure>
      <content>
         <section prefix="(1)(a)">(1)(a)The statutes ...</section>
         <section prefix="(b)">(b)To ensure public ..:</section>
         <section prefix="(I)">(I)Shall authorize ...;</section>
         <section prefix="(II)">(II)May authorize and ...:</section>
         <section prefix="(A)">(A)Compact disks;</section>
         <section prefix="(B)">(B)On-line public ...;</section>
         <section prefix="(C)">(C)Electronic applications for ..;</section>
         <section prefix="(D)">(D)Electronic books or ...</section>
         <section prefix="(E)">(E)Other electronic products or formats;</section>
         <section prefix="(III)">(III)May, pursuant ...</section>
         <section prefix="(IV)">(IV)Recognizes that ...</section>
         <section prefix="(2)(a)">(2)(a)Any person, ...:</section>
         <section prefix="(I)">(I)A statement specifying ...;</section>
         <section prefix="(II)">(II)A statement specifying ...;</section>
         <section prefix="(3)">(3)Level 1 section with no children ...;</section>
      </content>
   </structure>
</law>

由于Section元素是由 XSLT 2.0 样式表通过标记结束 P 标记动态创建的,您如何通过前缀属性 与已知的语义层次结构动态建立父子关系?

其他编程语言经验为我指明了递归的方向,基于前缀的标记化和逻辑到其先前的嵌套前缀的前缀 - 努力寻找有关如何使用 v2.0(使用 v1.0)的有限 XSLT 知识来执行此操作的任何信息。 0 大约 10 多年前)。我知道我可以使用外部 Python 脚本进行解析并完成,但为了可维护性,我尝试坚持使用 XSLT 2.0 样式表解决方案。

感谢任何帮助,让我走上正确的轨道和/或解决方案。

4

2 回答 2

2

您已经解决了问题的一个棘手阶段,以使用如下元素创建中间输出:

<section prefix="(1)(a)">text</section>

我的下一步是计算一个级别数,所以它看起来像这样:

<section level="1" prefix="(1)(a)">text</section>

计算级别数只是一个查看前缀匹配几个正则表达式中的哪一个的问题:(1) 给你 1 级,(b) 给你 2 级,等等。

获得级别编号后,您可以使用本文所述的递归位置分组:http: //www.saxonica.com/papers/ideadb-1.1/mhk-paper.xml

于 2013-10-29T09:11:50.600 回答
1

我对此进行了一些尝试,并提出了以下样式表:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:mf="http://example.com/mf"
 xmlns:d="data:,dpc" 
 exclude-result-prefixes="xs d mf">

    <xsl:include href="htmlparse.xml"/>

    <xsl:param name="patterns" as="element(pattern)*" xmlns="">
      <pattern value="^\s*(\([0-9]+\))" group="1" next="1"/>
      <pattern value="^\s*(\([0-9]+\))?\s*(\([a-z]\))" group="2" next="0"/>
      <pattern value="^\s*(\(*(I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII)\))" group="1" next="0"/>
      <pattern value="^\s*(\([A-Z]?\))" group="1" next="0"/>
    </xsl:param>

    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:function name="mf:group" as="element(section)*">
      <xsl:param name="paragraphs" as="element(p)*"/>
      <xsl:param name="patterns" as="element(pattern)*"/>
      <xsl:variable name="pattern1" as="element(pattern)?" select="$patterns[1]"/>
      <xsl:for-each-group select="$paragraphs" group-starting-with="p[matches(., $pattern1/@value)]">
        <xsl:variable name="prefix" as="xs:string?">
          <xsl:analyze-string select="." regex="{$pattern1/@value}">
            <xsl:matching-substring>
              <xsl:sequence select="string(regex-group(xs:integer($pattern1/@group)))"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <section prefix="{$prefix}">
          <xsl:choose>
            <xsl:when test="xs:boolean(xs:integer($pattern1/@next))">
              <xsl:sequence select="mf:group(current-group(), $patterns[position() gt 1])"/>
            </xsl:when>
            <xsl:otherwise>
              <xsl:apply-templates select="node()">
                <xsl:with-param name="pattern" as="element(pattern)" select="$pattern1" tunnel="yes"/>
              </xsl:apply-templates>
              <xsl:sequence select="mf:group(current-group() except ., $patterns[position() gt 1])"/>
            </xsl:otherwise>
          </xsl:choose>
        </section>
      </xsl:for-each-group>
    </xsl:function>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">

        <content>
            <xsl:sequence select="mf:group(d:htmlparse(., '', true())/*, $patterns)"/>
        </content>
    </xsl:template>

    <xsl:template match="p/text()[1]">
      <xsl:param name="pattern" as="element(pattern)" tunnel="yes"/>
      <xsl:value-of select="replace(., $pattern/@value, '')"/>
    </xsl:template>
</xsl:stylesheet> 

它利用http://web-xslt.googlecode.com/svn/trunk/htmlparse/htmlparse.xsl,一个用 XSLT 2.0 编写的 HTML 标签汤解析器,将转义的 HTML 片段标记解析为节点,然后使用分组样式表中的函数mf:group。分组由作为参数传入的一系列正则表达式模式驱动。

将 Saxon 9.5 的样式表应用于您的输入样本时,我得到了结果

<law>
   <structure>
      <content>
         <section prefix="(1)">
            <section prefix="(a)">The statutes ... </section>
            <section prefix="(b)">To ensure public ..: <section prefix="(I)">Shall authorize ...; </section>
               <section prefix="(II)">May authorize and ...: <section prefix="(A)">Compact disks;
            </section>
                  <section prefix="(B)">On-line public ...; </section>
                  <section prefix="(C)">Electronic applications for ..;
            </section>
                  <section prefix="(D)">Electronic books or ... </section>
                  <section prefix="(E)">Other electronic products or formats;
            </section>
               </section>
               <section prefix="(III)">May, pursuant ... </section>
               <section prefix="(IV)">Recognizes that ... </section>
            </section>
         </section>
         <section prefix="(2)">
            <section prefix="(a)">Any person, ...: <section prefix="(I)">A statement specifying ...; </section>
               <section prefix="(II)">A statement
            specifying ...; </section>
            </section>
         </section>
      </content>
   </structure>
</law>

如果可以有超过 13 (XIII) 个部分,您将需要使用罗马数字的正则表达式模式编辑参数以列出更多数字,因为我目前只列出了包括 XIII 在内的数字。

根据评论和编辑的输入示例,我对样式表进行了一些调整:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:mf="http://example.com/mf"
 xmlns:d="data:,dpc" 
 exclude-result-prefixes="xs d mf">

    <xsl:include href="htmlparse.xml"/>

    <xsl:param name="patterns" as="element(pattern)*" xmlns="">
      <pattern value="^\s*(\([0-9]+\))" group="1" next="1"/>
      <pattern value="^\s*(\([0-9]+\))?\s*(\([a-z]\))" group="2" next="0"/>
      <pattern value="^\s*(\(*(I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII)\))" group="1" next="0"/>
      <pattern value="^\s*(\([A-Z]?\))" group="1" next="0"/>
    </xsl:param>

    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:function name="mf:group" as="element(section)*">
      <xsl:param name="paragraphs" as="element(p)*"/>
      <xsl:param name="patterns" as="element(pattern)*"/>
      <xsl:variable name="pattern1" as="element(pattern)?" select="$patterns[1]"/>
      <xsl:for-each-group select="$paragraphs" group-starting-with="p[matches(., $pattern1/@value)]">
        <xsl:variable name="prefix" as="xs:string?">
          <xsl:analyze-string select="." regex="{$pattern1/@value}">
            <xsl:matching-substring>
              <xsl:sequence select="string(regex-group(xs:integer($pattern1/@group)))"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <section prefix="{$prefix}">
          <xsl:choose>
            <xsl:when test="xs:boolean(xs:integer($pattern1/@next)) and matches(., $patterns[2]/@value)">
              <xsl:sequence select="mf:group(current-group(), $patterns[position() gt 1])"/>
            </xsl:when>
            <xsl:otherwise>
              <xsl:apply-templates select="node()">
                <xsl:with-param name="pattern" as="element(pattern)" select="$pattern1" tunnel="yes"/>
              </xsl:apply-templates>
              <xsl:sequence select="mf:group(current-group() except ., $patterns[position() gt 1])"/>
            </xsl:otherwise>
          </xsl:choose>
        </section>
      </xsl:for-each-group>
    </xsl:function>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">

        <content>
            <xsl:sequence select="mf:group(d:htmlparse(., '', true())/*, $patterns)"/>
        </content>
    </xsl:template>

    <xsl:template match="p/text()[1]">
      <xsl:param name="pattern" as="element(pattern)" tunnel="yes"/>
      <xsl:value-of select="replace(., $pattern/@value, '')"/>
    </xsl:template>
</xsl:stylesheet> 

现在它变了

<?xml version="1.0" encoding="UTF-8"?>
<Content Type="Statutes">
  <Indexes>
    <!--SNIP-->
    <Index Level="3" HasChildren="0">
      <!--SNIP-->
      <Content>&lt;p&gt; (1)(a)The statutes ... &lt;/p&gt;&lt;p&gt; (b)To ensure public ..: &lt;/p&gt;&lt;p&gt; 
            (I)Shall authorize ...; &lt;/p&gt;&lt;p&gt; (II)May authorize and ...: &lt;/p&gt;&lt;p&gt; (A)Compact disks; 
            &lt;/p&gt;&lt;p&gt; (B)On-line public ...; &lt;/p&gt;&lt;p&gt; (C)Electronic applications for ..; 
            &lt;/p&gt;&lt;p&gt; (D)Electronic books or ... &lt;/p&gt;&lt;p&gt; (E)Other electronic products or formats; 
            &lt;/p&gt;&lt;p&gt; (III)May, pursuant ... &lt;/p&gt;&lt;p&gt; (IV)Recognizes that ... &lt;/p&gt;&lt;p&gt; 
            (2)(a)Any person, ...: &lt;/p&gt;&lt;p&gt; (I)A statement specifying ...; &lt;/p&gt;&lt;p&gt; (II)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (3)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (4)A statement 
            specifying ...; &lt;/p&gt;</Content>
    </Index>
    <!--SNIP-->
  </Indexes>
</Content>

<law>
   <structure>
      <content>
         <section prefix="(1)">
            <section prefix="(a)">The statutes ... </section>
            <section prefix="(b)">To ensure public ..: <section prefix="(I)">Shall authorize ...; </section>
               <section prefix="(II)">May authorize and ...: <section prefix="(A)">Compact disks;
            </section>
                  <section prefix="(B)">On-line public ...; </section>
                  <section prefix="(C)">Electronic applications for ..;
            </section>
                  <section prefix="(D)">Electronic books or ... </section>
                  <section prefix="(E)">Other electronic products or formats;
            </section>
               </section>
               <section prefix="(III)">May, pursuant ... </section>
               <section prefix="(IV)">Recognizes that ... </section>
            </section>
         </section>
         <section prefix="(2)">
            <section prefix="(a)">Any person, ...: <section prefix="(I)">A statement specifying ...; </section>
               <section prefix="(II)">A statement
            specifying ...; </section>
            </section>
         </section>
         <section prefix="(3)">A statement
            specifying ...; </section>
         <section prefix="(4)">A statement
            specifying ...; </section>
      </content>
   </structure>
</law>
于 2013-10-30T14:46:33.807 回答