python - 使用 Python 解析大型 xml 文件 - etree.parse 错误

Question

尝试使用 lxml.etree.iterparse 函数解析以下 Python 文件。

“样本输出.xml”

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

我尝试了Parsing Large XML file with Python lxml and Iterparse中的代码

在 etree.iterparse(MYFILE) 调用之前，我做了 MYFILE = open("/Users/eric/Desktop/wikipedia_map/sampleoutput.xml","r")

但它出现了以下错误

Traceback (most recent call last):
  File "/Users/eric/Documents/Programming/Eclipse_Workspace/wikipedia_mapper/testscraper.py", line 6, in <module>
    for event, elem in context :
  File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:98565)
  File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:99086)
  File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 5, column 1

有任何想法吗？谢谢你！

score 14 · Accepted Answer

问题是，如果 XML 没有一个顶级标记，那么它的格式就不是很好。您可以通过将整个文档包装在<items></items>标签中来修复您的示例。您还需要<desc/>标签来匹配您正在使用的查询 ( description)。

以下文档使用您现有的代码产生正确的结果：

<items>
  <item>
    <title>Item 1</title>
    <description>Description 1</description>
  </item>
  <item>
    <title>Item 2</title>
    <description>Description 2</description>
  </item>
</items>

score 5 · Accepted Answer

据我所知，xml.etree.ElementTree 通常希望 XML 文件包含一个“根”元素，即一个包含完整文档结构的 XML 标记。从您发布的错误消息中，我认为这也是这里的问题：

“第 5 行”指的是第二个<item>标签，所以我猜 Python 会抱怨在假定的根元素（即第一个<item>标签）关闭之后有更多数据。

python - 使用 Python 解析大型 xml 文件 - etree.parse 错误

2 回答 2

Related

Reference