python - 如何在 Python 中迭代解析大型 XML 文件？

Question

我需要处理一个大约 8Gb 的大 .XML 文件。文件结构（简化）类似于以下内容：

<TopLevelElement>
    <SomeElementList>
        <Element>zzz</Element>
        ....and so on for thousands of rows
    </SomeElementList>
    <Records>
        <RecordType1>
            <RecordItem id="aaaa">
                <SomeData>
                    <SomeMoreData NameType="xxx">
                        <NameComponent1>zzz</NameComponent1>
                        ....
                        <AnotherNameComponent>zzzz</AnotherNameComponent>
                    </SomeMoreData>
                </SomeData>
            </RecordItem>
        ..... hundreds of thousands of items, some are quite large.
        </RecordType1>
        <RecordType2>
            <RecordItem id="cccc">
            ...hundreds of thousands of RecordType2 elements, slightly different from RecordItems in RecordType1 
            </RecordItem>
        </RecordType2>
    </Records>
</TopLevelElement>

我需要提取 RecordType1 和 RecordType2 元素中的一些子元素。有条件决定哪些记录项需要处理，哪些字段需要提取。单个 RecordItems 不超过 120k（有些有大量文本数据，我不需要）。

这是代码。函数 get_all_records 接收以下输入： a) XML 文件的路径；b) 记录类别（“RecordType1”或“RecordType2”）；c) 选择什么名称组件

from xml.etree import cElementTree as ET

def get_all_records(xml_file_path, record_category, name_types, name_components):
    context = ET.iterparse(xml_file_path, events=("start", "end"))
    context = iter(context)
    event, root = next(context)
    all_records = []
    for event, elem in context:
        if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
            record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
            if record_contents:
                all_records += record_contents
            root.clear()
    return all_records

我已经尝试过记录的数量，代码在大约一分钟内很好地处理了 100k RecordItems（仅 Type1，到达 Type2 需要太长时间）。试图处理更多的记录（我拿了一百万），最终导致 ElementTree.py 中的 MemoryError。所以我猜尽管有 root.clear() 声明，但没有释放内存。

一个理想的解决方案是一次读取一个 RecordItems，进行处理，然后从内存中丢弃，但我不知道如何做到这一点。从 XML 的角度来看，两个额外的元素层（TopLevelElement 和 Records）似乎使任务复杂化。我是 XML 和相应 Python 库的新手，因此非常感谢详细解释！

score 1 · Accepted Answer

遍历一个巨大的 XML 文件总是很痛苦的。

我将从头到尾回顾所有过程，提出保持低内存同时最大化解析速度的最佳实践。

首先不需要将 ET.iterparse 存储为变量。像这样迭代它

for event, elem in ET.iterparse(xml_file, events=("start", "end")): 这个迭代器是为......创建迭代而不在内存中存储除当前标签之外的任何其他内容。此外，您不需要root.clear()这种新方法，只要您的硬盘空间允许它用于巨大的 XML 文件，您就可以使用。

您的代码应如下所示：

from xml.etree import cElementTree as ET

def get_all_records(xml_file_path, record_category, name_types, name_components):
    all_records = []
    for event, elem in ET.iterparse(xml_file_path, events=("start", "end")):
        if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
            record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
            if record_contents:
                all_records += record_contents
    return all_records

另外，请仔细考虑您需要存储整个列表的原因all_records。如果仅用于在进程结束时写入 CSV 文件 - 这个原因还不够好，并且在扩展到更大的 XML 文件时可能会导致内存问题。

确保在该行发生时将每个新行写入 CSV，从而将内存问题变为无问题。

附言

如果您需要在找到主标签之前存储多个标签，以便在您查看 XML 文件时解析这些历史信息 - 只需将其本地存储在一些新变量中即可。每当 XML 文件中的未来数据使您返回到您知道已经发生的特定标记时，这都会派上用场。

python - 如何在 Python 中迭代解析大型 XML 文件？

1 回答 1

Related

Reference