python - 使用 python 的 elementtree 解析大型 xml 数据

Question

我目前正在学习如何使用 elementtree 解析 xml 数据。我收到一条错误消息：ParseError: not well-formed (invalid token): line 1, column 2。

我的代码就在下面，一些 xml 数据在我的代码之后。

import xml.etree.ElementTree as ET

tree = ET.fromstring("C:\pbc.xml")
root = tree.getroot()


for article in root.findall('article'):
    print ' '.join([t.text for t in pub.findall('title')])
    for author in article.findall('author'):
        print 'Author name: {}'.format(author.text)
    for journal in article.findall('journal'):  # all venue tags with id attribute
        print 'journal'

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2002-01-03" key="persons/Codd71a">
<author>E. F. Codd</author>
<title>Further Normalization of the Data Base Relational Model.</title>
<journal>IBM Research Report, San Jose, California</journal>
<volume>RJ909</volume>
<month>August</month>
<year>1971</year>
<cdrom>ibmTR/rj909.pdf</cdrom>
<ee>db/labs/ibm/RJ909.html</ee>
</article>

<article mdate="2002-01-03" key="persons/Hall74">
<author>Patrick A. V. Hall</author>
<title>Common Subexpression Identification in General Algebraic Systems.</title>
<journal>Technical Rep. UKSC 0060, IBM United Kingdom Scientific Centre</journal>
<month>November</month>
<year>1974</year>
</article>

score 1 · Accepted Answer

您正在使用.fromstring()而不是.parse()：

import xml.etree.ElementTree as ET

tree = ET.parse("C:\pbc.xml")
root = tree.getroot()

.fromstring()期望以字节串而不是文件名的形式给出 XML 数据。

如果文档真的很大（很多兆字节或更多），那么您应该改用该ET.iterparse()函数并清除已处理的元素：

for event, article in ET.iterparse('C:\\pbc.xml', tag='article'):
    for title in aarticle.findall('title'):
        print 'Title: {}'.format(title.txt)
    for author in article.findall('author'):
        print 'Author name: {}'.format(author.text)
    for journal in article.findall('journal'):
        print 'journal'

    article.clear()

score 1 · Accepted Answer

with open("C:\pbc.xml", 'rb') as f:
    root = ET.fromstring(f.read().strip())

与不同ET.parse，ET.fromstring需要一个带有 XML 内容的字符串，而不是文件名。

同样与相比ET.parse，ET.fromstring返回根元素，而不是树。所以你应该省略

root = tree.getroot()

此外，您发布的 XML 片段需要关闭</dblp>才能解析。我假设你的真实数据有那个结束标签......

提供的 iterparsexml.etree.ElementTree没有tag参数，尽管lxml.etree.iterparse有tag参数。

尝试：

import xml.etree.ElementTree as ET
import htmlentitydefs

filename = "test.xml"
# http://stackoverflow.com/a/10792473/190597 (lambacck)
parser = ET.XMLParser()
parser.entity.update((x, unichr(i)) for x, i in htmlentitydefs.name2codepoint.iteritems())
context = ET.iterparse(filename, events = ('end', ), parser=parser)
for event, elem in context:
    if elem.tag == 'article':
        for author in elem.findall('author'):
            print 'Author name: {}'.format(author.text)
        for journal in elem.findall('journal'):  # all venue tags with id attribute
            print(journal.text)
        elem.clear()

注意：要使用iterparse您的 XML，您的 XML 必须有效，这意味着文件开头不能有空行。

score 0 · Accepted Answer

您最好不要将 xml 文件的元信息放入解析器。如果标签闭合良好，则解析器会做得很好。所以<?xml解析器可能无法识别。所以省略前两行，然后再试一次。:-)

python - 使用 python 的 elementtree 解析大型 xml 数据

3 回答 3

Related

Reference