有点卡在这个上。数据以以下格式提供(非重要内容被剪断):
<?xml version="1.0" encoding="UTF-8"?>
<Content Type="Statutes">
<Indexes>
<!--SNIP-->
<Index Level="3" HasChildren="0">
<!--SNIP-->
<Content><p> (1)(a)The statutes ... </p><p> (b)To ensure public ..: </p><p>
(I)Shall authorize ...; </p><p> (II)May authorize and ...: </p><p> (A)Compact disks;
</p><p> (B)On-line public ...; </p><p> (C)Electronic applications for ..;
</p><p> (D)Electronic books or ... </p><p> (E)Other electronic products or formats;
</p><p> (III)May, pursuant ... </p><p> (IV)Recognizes that ... </p><p>
(2)(a)Any person, ...: </p><p> (I)A statement specifying ...; </p><p> (II)A statement
specifying ...; </p><p> (3)A statement
specifying ...; </p><p> (4)A statement
specifying ...; </p></Content>
</Index>
<!--SNIP-->
</Indexes>
</Content>
需要获取包含语义层次结构的元素Content的文本值:
(1)
+-(a)
+-(I)
+-(A)
...并通过 XSLT 2.0 转换作为父子元素关系作为最终输出放置:
<?xml version="1.0" encoding="UTF-8"?>
<law>
<!--SNIP-->
<content>
<section prefix="(1)">
<section prefix="(a)">The statutes ...
<section prefix="(b)">To ensure public ..:
<section prefix="(I)">Shall authorize ...;</section>
<section prefix="(II)">May authorize and ...:
<section prefix="(A)">Compact disks;</section>
<section prefix="(B)">On-line public ...;</section>
<section prefix="(C)">Electronic applications for ..;</section>
<section prefix="(D)">Electronic books or ...</section>
<section prefix="(E)">Other electronic products or formats;</section>
</section>
<section prefix="(III)">May, pursuant ...</section>
<section prefix="(IV)">Recognizes that ...</section>
</section>
</section>
<section prefix="(2)">
<section prefix="(a)">Any person, ...:
<section prefix="(I)">A statement specifying ...;</section>
<section prefix="(II)">A statement specifying ...;</section>
</section>
</section>
<section prefix="(3)">Level 1 node with no children</section>
</content>
</law>
我能够从 Content 的文本值中标记结束 html 编码的 P 标签,但不知道如何获取动态创建的元素以在条件条件下创建子元素。
我的 XSLT 2.0 样式表:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/Content">
<!-- Work from the lowest index level with no children up -->
<xsl:apply-templates select=".//Index[@HasChildren=0]"/>
</xsl:template>
<xsl:template match="Index[@HasChildren=0]">
<law>
<structure>
<xsl:apply-templates select="Content"/>
</structure>
</law>
</xsl:template>
<!-- Template for Content element from originial -->
<xsl:template match="Content">
<content>
<!-- Loop through HTML encoded P tag endings -->
<xsl:for-each select="tokenize(.,'</p>')">
<!-- Set Token to a variable and remove P opening tags -->
<xsl:variable name="sectionText">
<xsl:value-of select="normalize-space(replace(current(),'<p>',''))"/>
</xsl:variable>
<!-- Output -->
<xsl:if test="string-length($sectionText)!=0">
<section>
<!-- Set the section element's prefix attribute (if exists) -->
<xsl:analyze-string select="$sectionText" regex="^(\(([\w]+)\)){{1,3}}">
<xsl:matching-substring >
<xsl:attribute name="prefix" select="." />
</xsl:matching-substring>
</xsl:analyze-string>
<!-- Set the section element's value -->
<xsl:value-of select="$sectionText"/>
</section>
</xsl:if>
</xsl:for-each>
</content>
</xsl:template>
</xsl:stylesheet>
...这让我走到了这一步 - 在部分元素中没有语义层次结构:
<?xml version="1.0" encoding="UTF-8"?>
<law>
<structure>
<content>
<section prefix="(1)(a)">(1)(a)The statutes ...</section>
<section prefix="(b)">(b)To ensure public ..:</section>
<section prefix="(I)">(I)Shall authorize ...;</section>
<section prefix="(II)">(II)May authorize and ...:</section>
<section prefix="(A)">(A)Compact disks;</section>
<section prefix="(B)">(B)On-line public ...;</section>
<section prefix="(C)">(C)Electronic applications for ..;</section>
<section prefix="(D)">(D)Electronic books or ...</section>
<section prefix="(E)">(E)Other electronic products or formats;</section>
<section prefix="(III)">(III)May, pursuant ...</section>
<section prefix="(IV)">(IV)Recognizes that ...</section>
<section prefix="(2)(a)">(2)(a)Any person, ...:</section>
<section prefix="(I)">(I)A statement specifying ...;</section>
<section prefix="(II)">(II)A statement specifying ...;</section>
<section prefix="(3)">(3)Level 1 section with no children ...;</section>
</content>
</structure>
</law>
由于Section元素是由 XSLT 2.0 样式表通过标记结束 P 标记动态创建的,您如何通过前缀属性 与已知的语义层次结构动态建立父子关系?
其他编程语言经验为我指明了递归的方向,基于前缀的标记化和逻辑到其先前的嵌套前缀的前缀 - 努力寻找有关如何使用 v2.0(使用 v1.0)的有限 XSLT 知识来执行此操作的任何信息。 0 大约 10 多年前)。我知道我可以使用外部 Python 脚本进行解析并完成,但为了可维护性,我尝试坚持使用 XSLT 2.0 样式表解决方案。
感谢任何帮助,让我走上正确的轨道和/或解决方案。