python - Parsing large combined XML document with Python

Question

I have one large document (400 mb), which contains hundreds of XML documents, each with their own declarations. I am trying to parse each document using ElementTree in Python. I am having a lot of trouble with splitting each XML document in order to parse out the information. Here is an example of what the document looks like:

<?xml version="1.0"?>
<data>
    <more>
       <p></p>
    </more>
</data>
<?xml version="1.0"?>
<different data>
    <etc>
       <p></p>
    </etc>
</different data>
<?xml version="1.0"?>
<continues.....>

Ideally I would like to read through each XML declaration, parse the data, and continue on with the next XML document. Any suggestions will help.

score 2 · Accepted Answer

您需要单独阅读文件；这是一个生成器函数，它将从给定的文件对象生成完整的 XML 文档：

def xml_documents(fileobj):
    document = []
    for line in fileobj:
        if line.strip().startswith('<?xml') and document:
                yield ''.join(document)
                document = []
        document.append(line)

    if document:
        yield ''.join(document)

然后用于ElementTree.fromstring()加载和解析这些：

with open('file_with_multiple_xmldocuments') as fileobj:
    for xml in xml_documents(fileobj):
        tree = ElementTree.fromstring(xml)

python - Parsing large combined XML document with Python

1 回答 1

Related

Reference