我一直在玩 PETL,看看是否可以提取多个 xml 文件并将它们合并为一个。
我无法控制 XML 文件的结构,这是我看到的变化,这给我带来了麻烦。
XML 文件 1 示例:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name>John Doe</Name>
<Date>01/01/2021</Date>
</Info>
<App>
<Description></Description>
<Type>Two</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
XML 文件 2 示例:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name></Name>
<Date>01/02/2021</Date>
</Info>
<App>
<Description>Sample description here.</Description>
<Type>One</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
<DetailOne>3</DetailOne>
<DetailTwo>4</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>
我的 python 代码只是扫描子文件夹 xmlfiles,然后尝试使用 PETL 从那里解析。根据文档的结构,到目前为止,我正在加载三个表:
1 保存信息名称和日期 2 保存描述并键入 3 收集详细信息
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
'DetailOne': 'DetailOne',
'DetailTwo': 'DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)
我将这三个表连接起来,因为我希望每一行的 Info 和 App 数据都包含每个细节。这一直有效,直到我得到一个包含多个 DetailOne 和 DetailTwo 元素的 XML 文件。
我得到的结果是:
结果:
+------------+----------+-------------+------+-----------+-----------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None | Two | 1 | 2 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None | Two | 10 | 11 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
结果:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | ('1', '3') | ('2', '4') | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
显示 DetailOne 为 ('1','3') 和 DetailTwo 为 ('2', '4') 的第二个文件不是我想要的。
我想要的是:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | 1 | 2 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 3 | 4 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
我相信 XPath 可能是要走的路,但经过研究:
https://petl.readthedocs.io/en/stable/io.html#xml-files - 没有深入了解 lxml 和 petl
这里有一些简单的阅读: https ://www.w3schools.com/xml/xpath_syntax.asp
更多阅读: https ://lxml.de/tutorial.html
对此的任何帮助表示赞赏!