python - 将 .tei 文件转换为 .txt 文件

Question

我有.tei以下格式的文件。

<biblStruct xml:id="b0">
    <analytic>
        <title level="a" type="main">The Semantic Web</title>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">T</forename>
                <surname>Berners-Lee</surname>
            </persName>
        </author>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">J</forename>
                <surname>Hendler</surname>
            </persName>
        </author>
        <author>
            <persName xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">O</forename>
                <surname>Lassilia</surname>
            </persName>
        </author>
    </analytic>
    <monogr>
        <title level="j">Scientific American</title>
        <imprint>
            <date type="published" when="2001-05" />
        </imprint>
    </monogr>
</biblStruct>

我想将上述文件转换为如下所示的.txt格式：

T. Berners-Lee、J. Hendler 和 O. Lassilia。“语义网”，《科学美国人》，2001 年 5 月

我尝试使用以下代码：

tree = ET.parse(path)
root = tree.getroot()
s = ""
for childs in root:
    for child in childs:
        s= s+child.text

上面代码的问题是循环顺序执行，字符串不是顺序格式。

其次，可能还有更多的内循环。在没有手动检查的情况下提取内部循环中的内容也是有问题的。请在这件事上给予我帮助

score 0 · Accepted Answer

我知道您正在寻找 Python 解决方案，但是因为 XSLT 是一种非常方便的替代方案并且非常适合.xml文件，所以无论如何我都会发布 XSLT 解决方案。

我想它可以很容易地集成到您的 Python 解决方案中。
所以这是必要的 XSLT：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:month="http://month.com">
    <xsl:output method="text" />
    <xsl:strip-space elements="*" />

    <month:month>
        <month name="Jan" />
        <month name="Feb" />
        <month name="Mar" />
        <month name="Apr" />
        <month name="May" />
        <month name="Jun" />
        <month name="Jul" />
        <month name="Aug" />
        <month name="Sep" />
        <month name="Oct" />
        <month name="Nov" />
        <month name="Dec" />
    </month:month>

    <xsl:template match="author[position()=1]">
        <xsl:value-of select="concat(tei:persName/tei:forename, '. ',tei:persName/tei:surname)" />
    </xsl:template>    

    <xsl:template match="author">
        <xsl:value-of select="concat(', ',tei:persName/tei:forename, '. ',tei:persName/tei:surname)" />
    </xsl:template>

    <xsl:template match="author[last()]">
        <xsl:value-of select="concat(' and ',tei:persName/tei:forename, '. ',tei:persName/tei:surname)" />
    </xsl:template>

    <xsl:template match="/biblStruct">
        <xsl:apply-templates select="analytic/author" />
        <xsl:variable name="mon" select="number(substring(monogr/imprint/date/@when,6,2))" />
        <xsl:value-of select='concat(" &apos;",analytic/title,"&apos;",", ",monogr/title, ", ")' />   
        <xsl:value-of select="document('')/xsl:stylesheet/month:month/month[$mon]/@name" />
        <xsl:value-of select="concat(' ',/xsl:stylesheet/month:month[substring(monogr/imprint/date/@when,5,2)],substring(monogr/imprint/date/@when,1,4))" />
    </xsl:template>

</xsl:stylesheet>

您无需对 XSLT 有太多了解即可理解这段代码：
共有三个模板匹配author元素 - 一个匹配第一个匹配项，一个匹配last()匹配项，一个匹配介于两者之间的所有元素。它们的区别仅在于处理和之类的分隔,符and。

最后一个模板处理整个 XML 并结合其他三个模板的输出。它还设法通过引用month:month数据岛将数字月份转换为字符串。

您还应该查看xsl:stylesheet元素的已定义命名空间：

一种用于 XSL：http://www.w3.org/1999/XSL/Transform
TEI 一个：http://www.tei-c.org/ns/1.0
一个月：http://month.com用于数据岛

我希望我已经为使用 XSLT 文件进行转换提出了令人信服的案例。该xsl:output元素确实使用指定了所需的文本输出目标method="text"。

python - 将 .tei 文件转换为 .txt 文件

1 回答 1

Related

Reference