为什么不用正则表达式?
1)
使用 lxml 比使用正则表达式慢。
from time import clock
import StringIO
from lxml import etree
times1 = []
for i in xrange(1000):
data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
te = clock()
docs = etree.iterparse(data,tag='a')
tf = clock()
times1.append(tf-te)
print min(times1)
print [etree.tostring(y) for x,y in docs]
import re
regx = re.compile('<a>[\s\S]*?</a>')
times2 = []
for i in xrange(1000):
data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
te = clock()
li = regx.findall(data.read())
tf = clock()
times2.append(tf-te)
print min(times2)
print li
结果
0.000150298431784
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
2.40253998762e-05
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
0.000150298431784 / 2.40253998762e-05 是 6.25
lxml 比正则表达式慢 6.25 倍
.
2)
如果命名空间没有问题:
import StringIO
import re
regx = re.compile('<a>[\s\S]*?</a>')
data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
print regx.findall(data.read())
结果
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']