python - re.match 与 re.search 性能差异

Question

我尝试比较re.match和re.search使用timeit模块，我发现当我想要找到的字符串位于字符串的开头时，匹配比搜索要好。

>>> s1 = '''
... import re
... re.search(r'hello','helloab'*100000)
... '''
>>> timeit.timeit(stmt=s1,number=10000)
32.12064480781555


>>> s = '''
... import re
... re.match(r'hello','helloab'*100000)
... '''
>>> timeit.timeit(stmt=s,number=10000)
30.9136700630188

现在，我知道 match 在字符串的开头查找模式并在找到时返回一个对象，但我想知道的是 search 是如何操作的。

在开头找到字符串后，搜索是否会执行任何额外的匹配，这会减慢它的速度？

更新

使用@David Robinsons 代码后，我得到了与他相似的结果。

>>> print timeit.timeit(stmt="r.match('hello')",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
...              number = 10000000)
49.9567620754
>>> print timeit.timeit(stmt="r.search('hello')",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
...             number = 10000000)
35.6694438457

所以，更新后的问题是为什么search表现出色match？

score 15 · Accepted Answer

“所以，现在更新的问题是为什么搜索性能优于匹配？”

在这个使用文字字符串而不是正则表达式模式的特定实例中，确实re.search比re.match默认的 CPython 实现稍快（我没有在 Python 的其他版本中对此进行测试）。

>>> print timeit.timeit(stmt="r.match(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
...              number = 10000000)
3.29107403755
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
...             number = 10000000)
2.39184308052

查看这些模块背后的 C 代码，搜索代码似乎具有内置优化功能，可以快速匹配以字符串横向为前缀的模式。在上面的示例中，整个模式是一个没有正则表达式模式的文字字符串，因此这个优化的例程用于匹配整个模式。

请注意，一旦我们引入正则表达式符号，性能会如何下降，并且随着文字字符串前缀变短：

>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hell.')",
...             number = 10000000)

3.20765399933
>>> 
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')",
...             number = 10000000)
3.31512498856
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')",
...             number = 10000000)
3.31983995438
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')",
...             number = 10000000)
3.39261603355

对于包含正则表达式模式的部分模式，SRE_MATCH用于确定匹配。这与后面的代码基本相同re.match。

re.match请注意，如果模式以正则表达式模式而不是文字字符串开头，结果是如何接近的（稍微快一点）。

>>> print timeit.timeit(stmt="r.match(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
...              number = 10000000)
3.22782492638
>>> print timeit.timeit(stmt="r.search(s)",
...              setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
...             number = 10000000)
3.31773591042

换句话说，忽略search和match具有不同目的的事实re.search比re.match仅当模式是文字字符串时要快。

当然，如果您使用的是文字字符串，那么使用字符串操作可能会更好。

>>> # Detecting exact matches
>>> print timeit.timeit(stmt="s == r", 
...              setup="s = 'helloab'*100000; r = 'hello'", 
...              number = 10000000)
0.339027881622

>>> # Determine if string contains another string
>>> print timeit.timeit(stmt="s in r", 
...              setup="s = 'helloab'*100000; r = 'hello'", 
...              number = 10000000)
0.479326963425


>>> # detecting prefix
>>> print timeit.timeit(stmt="s.startswith(r)",
...              setup="s = 'helloab'*100000; r = 'hello'",
...             number = 10000000)
1.49393510818
>>> print timeit.timeit(stmt="s[:len(r)] == r",
...              setup="s = 'helloab'*100000; r = 'hello'",
...             number = 10000000)
1.21005606651

score 6 · Accepted Answer

在我的机器上（Mac OS 10.7.3 上的 Python 2.7.3，1.7 GHz Intel Core i5），当完成将字符串构造、导入 re 和正则表达式在设置中编译并执行 10000000 次迭代而不是 10 次时，我发现对面的：

import timeit

print timeit.timeit(stmt="r.match(s)",
             setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
             number = 10000000)
# 6.43165612221
print timeit.timeit(stmt="r.search(s)",
             setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
            number = 10000000)
# 3.85176897049

python - re.match 与 re.search 性能差异

2 回答 2

Related

Reference