python - 使用python反复查询xml

Question

我有一些需要运行查询的 xml 文档。我已经创建了一些 python 脚本（使用 ElementTree）来做到这一点，因为我对使用它有点熟悉。

它的工作方式是我使用不同的参数多次运行脚本，具体取决于我想要找出的内容。

这些文件可能相对较大（10MB+），因此解析它们需要相当长的时间。在我的系统上，只是运行：

tree = ElementTree.parse(document)

大约需要 30 秒，随后的 findall 查询只增加了大约一秒。

看到我这样做的方式需要我反复解析文件，我想知道是否有某种缓存机制可以使用，以便在后续查询中减少 ElementTree.parse 计算。

我意识到在这里做的聪明的事情可能是在 python 脚本中尝试和批处理尽可能多的查询，但我希望可能有另一种方法。

谢谢。

score 3 · Accepted Answer

虽然我支持使用 lxml 的建议，但您可以通过使用内置的 cElementTree 获得巨大的性能提升。

from xml.etree import cElementTree as ElementTree

score 1 · Accepted Answer

首先，考虑使用以下lxml实现ElementTree：
http://lxml.de/ 这是 libxml2 的包装器，我发现它执行得很好。

以交互方式运行 python，对同一个 etree 对象进行多个查询。ipython是一个增强的交互式 python 解释器，可以轻松访问自省和便利语法。

例如，使用ipython 以交互方式检查note.xmllxml.etree。

$ ipython
Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48)
Type "copyright", "credits" or "license" for more information.

IPython 0.8.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: from lxml import etree

In [2]: doc = etree.parse(open("note.xml"))

In [3]: etree.dump(doc.getroot())
<note>
        <to>Tove</to>
        <from>Jani</from>
        <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
</note>
In [4]: doc.xpath('/note/*')
Out[4]:
[<Element to at 89cf02c>,
 <Element from at 89cf054>,
 <Element heading at 89cf07c>,
 <Element body at 89cf0a4>]

score 1 · Accepted Answer

Seconding the lxml recommendation, look at this article for how to improve performance by using an iterative (SAX-like) parsing method. It can be a pain at first since it can turn really procedural and messy, but it makes things faster. As you can see from these benchmarks, lxml is most likely your best bet for performance.

python - 使用python反复查询xml

3 回答 3

Related

Reference