我必须处理足够大(最大 1GB)的 xml 文档并用 python 解析它们。我正在使用iterparse()函数(SAX 样式解析)。
我关心的是以下,假设你有一个这样的 xml
<?xml version="1.0" encoding="UTF-8" ?>
<families>
<family>
<name>Simpson</name>
<members>
<name>Homer</name>
<name>Marge</name>
<name>Bart</name>
</members>
</family>
<family>
<name>Griffin</name>
<members>
<name>Peter</name>
<name>Brian</name>
<name>Meg</name>
</members>
</family>
</families>
问题是,当然要知道我何时获得姓氏(如辛普森一家)以及何时获得该家庭成员之一的姓名(例如荷马)
到目前为止我一直在做的是使用“开关”,它会告诉我我是否在“成员”标签内,代码看起来像这样
import xml.etree.cElementTree as ET
__author__ = 'moriano'
file_path = "test.xml"
context = ET.iterparse(file_path, events=("start", "end"))
# turn it into an iterator
context = iter(context)
on_members_tag = False
for event, elem in context:
tag = elem.tag
value = elem.text
if value :
value = value.encode('utf-8').strip()
if event == 'start' :
if tag == "members" :
on_members_tag = True
elif tag == 'name' :
if on_members_tag :
print "The member of the family is %s" % value
else :
print "The family is %s " % value
if event == 'end' and tag =='members' :
on_members_tag = False
elem.clear()
这很好用,因为输出是
The family is Simpson
The member of the family is Homer
The member of the family is Marge
The member of the family is Bart
The family is Griffin
The member of the family is Peter
The member of the family is Brian
The member of the family is Meg
我担心的是,对于这个(简单的)示例,我必须创建一个额外的变量来知道我在哪个标签(on_members_tag)中,想象一下我必须处理的真正的 xml 示例,它们有更多的嵌套标签。
另请注意,这是一个非常简化的示例,因此您可以假设我可能面临一个带有更多标签、更多内部标签的 xml,并试图获取不同的标签名称、属性等。
所以问题是。我在这里做一些非常愚蠢的事情吗?我觉得必须有一个更优雅的解决方案。