1

I have one large document (400 mb), which contains hundreds of XML documents, each with their own declarations. I am trying to parse each document using ElementTree in Python. I am having a lot of trouble with splitting each XML document in order to parse out the information. Here is an example of what the document looks like:

<?xml version="1.0"?>
<data>
    <more>
       <p></p>
    </more>
</data>
<?xml version="1.0"?>
<different data>
    <etc>
       <p></p>
    </etc>
</different data>
<?xml version="1.0"?>
<continues.....>

Ideally I would like to read through each XML declaration, parse the data, and continue on with the next XML document. Any suggestions will help.

4

1 回答 1

2

您需要单独阅读文件;这是一个生成器函数,它将从给定的文件对象生成完整的 XML 文档:

def xml_documents(fileobj):
    document = []
    for line in fileobj:
        if line.strip().startswith('<?xml') and document:
                yield ''.join(document)
                document = []
        document.append(line)

    if document:
        yield ''.join(document)

然后用于ElementTree.fromstring()加载和解析这些:

with open('file_with_multiple_xmldocuments') as fileobj:
    for xml in xml_documents(fileobj):
        tree = ElementTree.fromstring(xml)
于 2013-03-26T18:52:26.577 回答