0

我有一个 XML 字符串

<tags>
   <person1>dave jones</person1>
   <person2>ron matthews</person2>
   <person3>sally van heerden</person3>
   <place>tygervalley</place>
   <ocassion>shopping</ocassion>
</tags>

我想使用“Sally Van Heerden”或“Tygervalley”等搜索词来搜索这个 xml 字符串

使用正则表达式查找此字符串中的术语是否更快,或者 Python 的 find() 方法是否足够快?我也可以使用 python 的元素树 XML 解析器进行搜索,然后构建 XML 树然后搜索它,但我担心它会太慢。

以上三个哪个最快?还有其他建议吗?

4

2 回答 2

3

The answer will really depend on what you are going to do with the search results. The only case when you should even consider not using an XML parser is when you don't remotely care about the XML document structure.

If this is the case, you can try timing all three, but building a tree is then not necessary and can take too much time to compete with the substring search.

Time all three to see the difference on a typical file for your problem. For instance, on your small example file:

$ python -m timeit "any('tygervalley' in line for line in open('t.xml'))"
100000 loops, best of 3: 14.6 usec per loop

$ python -m timeit "import re" "for line in open('t.xml'):" "    re.findall('tygervalley', line)"
10000 loops, best of 3: 27.4 usec per loop


$ python -m timeit "from lxml.etree import parse" "tree = parse('t.xml')" "tree.xpath('//*[text()=\'tygervalley\']')"
10000 loops, best of 3: 133 usec per loop

You can play around with the actual methods to call, there's always choice.

Edit: note how things change on a 100 times longer file:

$ python -m timeit "any('tygervalley' in line for line in open('t.xml'))"
100000 loops, best of 3: 20.8 usec per loop

$ python -m timeit "import re" "for line in open('t.xml'):" "    re.findall('tygervalley', line)"
1000 loops, best of 3: 252 usec per loop

$ python -m timeit "from lxml.etree import parse" "tree = parse('t.xml')" "tree.xpath('//*[text()=\'tygervalley\']')"
1000 loops, best of 3: 1.34 msec per loop

Be careful interpreting the results :)

于 2012-05-18T09:04:18.690 回答
0

我尝试比较 regexp 和 lxml 对于不是大的 xml 文件,并且两者之间没有很大的差异。

于 2012-05-18T08:47:22.313 回答