0

我一直在玩 PETL,看看是否可以提取多个 xml 文件并将它们合并为一个。

我无法控制 XML 文件的结构,这是我看到的变化,这给我带来了麻烦。

XML 文件 1 示例:

<?xml version="1.0" encoding="utf-8"?>
    <Export>
        <Info>
            <Name>John Doe</Name>
            <Date>01/01/2021</Date>
        </Info>
        <App>
            <Description></Description>
            <Type>Two</Type>
            <Details>
                <DetailOne>1</DetailOne>
                <DetailTwo>2</DetailTwo>
            </Details>
            <Details>
                <DetailOne>10</DetailOne>
                <DetailTwo>11</DetailTwo>
            </Details>
        </App>
    </Export>

XML 文件 2 示例:

<?xml version="1.0" encoding="utf-8"?>
    <Export>
        <Info>
            <Name></Name>
            <Date>01/02/2021</Date>
        </Info>
        <App>
            <Description>Sample description here.</Description>
            <Type>One</Type>
            <Details>
                <DetailOne>1</DetailOne>
                <DetailTwo>2</DetailTwo>
                <DetailOne>3</DetailOne>
                <DetailTwo>4</DetailTwo>
            </Details>
            <Details>
                <DetailOne>10</DetailOne>
                <DetailTwo>11</DetailTwo>
            </Details>
        </App>
    </Export>

我的 python 代码只是扫描子文件夹 xmlfiles,然后尝试使用 PETL 从那里解析。根据文档的结构,到目前为止,我正在加载三个表:

1 保存信息名称和日期 2 保存描述并键入 3 收集详细信息

import petl as etl
import os
from lxml import etree

for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
    if filename.endswith('.xml'):
        # Get the info children
        table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
            'Name': 'Name',
            'Date': 'Date'
        })

        # Get the App children
        table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
            'Description': 'Description',
            'Type': 'Type'
        })

        # Get the App Details children
        table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
            'DetailOne': 'DetailOne',
            'DetailTwo': 'DetailTwo'
        })

        # concat
        c = etl.crossjoin(table1, table2, table3)
        # I want the filename added on
        result = etl.addfield(c, 'FileName', filename)

        print('Results:\n', result)
                

我将这三个表连接起来,因为我希望每一行的 Info 和 App 数据都包含每个细节。这一直有效,直到我得到一个包含多个 DetailOne 和 DetailTwo 元素的 XML 文件。

我得到的结果是:

结果:

 +------------+----------+-------------+------+-----------+-----------+----------+
| Date       | Name     | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None        | Two  | 1         | 2         | one.xml  |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None        | Two  | 10        | 11        | one.xml  |
+------------+----------+-------------+------+-----------+-----------+----------+

结果:

 +------------+------+--------------------------+------+------------+------------+----------+
| Date       | Name | Description              | Type | DetailOne  | DetailTwo  | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One  | ('1', '3') | ('2', '4') | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 10         | 11         | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+

显示 DetailOne 为 ('1','3') 和 DetailTwo 为 ('2', '4') 的第二个文件不是我想要的。

我想要的是:

+------------+------+--------------------------+------+------------+------------+----------+
| Date       | Name | Description              | Type | DetailOne  | DetailTwo  | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One  | 1          | 2          | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 3          | 4          | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 10         | 11         | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+

我相信 XPath 可能是要走的路,但经过研究:

https://petl.readthedocs.io/en/stable/io.html#xml-files - 没有深入了解 lxml 和 petl

这里有一些简单的阅读: https ://www.w3schools.com/xml/xpath_syntax.asp

更多阅读: https ://lxml.de/tutorial.html

对此的任何帮助表示赞赏!

4

1 回答 1

0

首先,感谢您花时间写一个好问题。我很高兴花时间回答它。

我从未使用过 PETL,但我确实扫描了文档以进行 XML 处理。我认为您的主要问题是<Details>标签有时包含一对标签,有时包含多对。如果只有一种方法可以提取和标记值的平面列表,而不会妨碍封闭标记......

幸运的是有。我使用了https://www.webtoolkitonline.com/xml-xpath-tester.html并且 XPath 表达式在应用于您的示例 XML 时//Details/DetailOne返回列表。1,3,10

所以我怀疑这样的事情应该有效:

import petl as etl
import os
from lxml import etree

for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
    if filename.endswith('.xml'):
        # Get the info children
        table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
            'Name': 'Name',
            'Date': 'Date'
        })

        # Get the App children
        table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
            'Description': 'Description',
            'Type': 'Type'
        })

        # Get the App Details children
        table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
            'DetailOne': '//DetailOne',
            'DetailTwo': '//DetailTwo'
        })

        # concat
        c = etl.crossjoin(table1, table2, table3)
        # I want the filename added on
        result = etl.addfield(c, 'FileName', filename)

        print('Results:\n', result)

前导 //可能是多余的。它是“在文档中的任何级别”的 XPath 语法。我不知道 PETL 如何处理 XPath,所以我尽量安全。顺便说一句,我同意 - 文档细节相当简单。

于 2021-11-03T22:38:20.600 回答