python - 使用“xmltodict”模块解析大型 XML 文件会导致 OverflowError

Question

我有一个相当大的大约 3GB 大小的 XML 文件，我想使用“xmltodict”实用程序以流模式解析它。我的代码遍历每个项目并形成一个字典项并附加到内存中的字典，最终作为 json 转储到文件中。

我在一个小的 xml 数据集上完美地工作了以下内容：

    import xmltodict, json
    import io

    output = []

    def handle(path, item):
       #do stuff
       return

    doc_file = open("affiliate_partner_feeds.xml","r")
    doc = doc_file.read()        
    xmltodict.parse(doc, item_depth=2, item_callback=handle)

    f = open('jbtest.json', 'w')
    json.dump(output,f)

在一个大文件上，我得到以下信息：

Traceback (most recent call last):
  File "jbparser.py", line 125, in <module>
    **xmltodict.parse(doc, item_depth=2, item_callback=handle)**
  File "/usr/lib/python2.7/site-packages/xmltodict.py", line 248, in parse
    parser.Parse(xml_input, True)
  OverflowError: size does not fit in an int

xmltodict.py 中异常的确切位置是：

def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
          namespace_separator=':', **kwargs):

        handler = _DictSAXHandler(namespace_separator=namespace_separator,
                                  **kwargs)
        if isinstance(xml_input, _unicode):
            if not encoding:
                encoding = 'utf-8'
            xml_input = xml_input.encode(encoding)
        if not process_namespaces:
            namespace_separator = None
        parser = expat.ParserCreate(
            encoding,
            namespace_separator
        )
        try:
            parser.ordered_attributes = True
        except AttributeError:
            # Jython's expat does not support ordered_attributes
            pass
        parser.StartElementHandler = handler.startElement
        parser.EndElementHandler = handler.endElement
        parser.CharacterDataHandler = handler.characters
        parser.buffer_text = True
        try:
            parser.ParseFile(xml_input)
        except (TypeError, AttributeError):
            **parser.Parse(xml_input, True)**
        return handler.item

有什么办法可以解决这个问题？AFAIK，xmlparser对象没有暴露给我玩并将'int'更改为'long'。更重要的是，这里到底发生了什么？非常感谢这方面的任何线索。谢谢！

score 0 · Accepted Answer

尝试使用 marshal.load(file) 或 marshal.load(sys.stdin) 来反序列化文件（或将其用作流），而不是将整个文件读入内存，然后将其作为一个整体进行解析。

这是一个例子：

>>> def handle_artist(_, artist):
...     print artist['name']
...     return True
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

标准输入：

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']

python - 使用“xmltodict”模块解析大型 XML 文件会导致 OverflowError

1 回答 1

Related

Reference