python - lxml etree.parse 内存分配错误

Question

我正在使用 lxml etree.parse 以某种方式解析一个巨大的 XML 文件（大约 65MB - 300MB）。当我运行包含以下函数的独立 python 脚本时，我遇到内存分配失败：

Error:

     Memory allocation failed : xmlSAX2Characters, line 5350155, column 16

部分功能代码：

def getID():
        try:
            from lxml import etree
            xml = etree.parse(<xml_file>)  # here is where the failure occurs
            for element in xml.iter():
                   ...
                   result = <formed by concatenating element texts>
            return result
        except Exception, ex:
            <handle exception>

奇怪的是，当我在 IDLE 上输入相同的函数并测试相同的 XML 文件时，我没有遇到任何 MemoryAllocation 错误。

关于这个问题的任何想法？提前致谢。

score 3 · Accepted Answer

我将使用迭代解析器来解析文档，调用.clear()您完成的任何元素；这样您就不必一次性将整个文档加载到内存中。

您可以将迭代解析器限制为仅那些您感兴趣的标签。如果您只想解析<person>标签，请告诉您的解析器：

for _, element in etree.iterparse(input, tag='person'):
    # process your person data
    element.clear()

通过清除循环中的元素，您可以将其从内存中释放出来。

python - lxml etree.parse 内存分配错误

1 回答 1

Related

Reference