python - 提高 python 正则表达式的性能

Question

试图改进下面的正则表达式：

urlpath=columns[4].strip()
                                urlpath=re.sub("(\?.*|\/[0-9a-f]{24})","",urlpath)
                                urlpath=re.sub("\/[0-9\/]*","/",urlpath)
                                urlpath=re.sub("\;.*","",urlpath)
                                urlpath=re.sub("\/",".",urlpath)
                                urlpath=re.sub("\.api","api",urlpath)
                                if urlpath in dlatency:

这会像这样转换 URL：

/api/v4/path/apiCallTwo?host=wApp&trackId=1347158

至

api.v4.path.apiCallTwo

想尝试并在性能方面改进正则表达式，因为此脚本每 5 分钟运行一次大约 50,000 个文件，并且总共需要大约 40 秒才能运行。

谢谢你

score 2 · Accepted Answer

带有urlparse的单行代码：

urlpath = urlparse.urlsplit(url).path.strip('/').replace('/', '.')

score 2 · Accepted Answer

这是我的 oneliner 解决方案（已编辑）。

urlpath.partition("?")[0].strip("/").replace("/", ".")

正如其他一些人提到的那样，这里的速度改进可以忽略不计。除了使用 re.compile() 来预编译你的表达式，我会开始寻找其他地方。

import re


re1 = re.compile("(\?.*|\/[0-9a-f]{24})")
re2 = re.compile("\/[0-9\/]*")
re3 = re.compile("\;.*")
re4 = re.compile("\/")
re5 = re.compile("\.api")
def orig_regex(urlpath):
    urlpath=re1.sub("",urlpath)
    urlpath=re2.sub("/",urlpath)
    urlpath=re3.sub("",urlpath)
    urlpath=re4.sub(".",urlpath)
    urlpath=re5.sub("api",urlpath)
    return urlpath


myregex = re.compile(r"([^/]+)")
def my_regex(urlpath):
    return ".".join( x.group() for x in myregex.finditer(urlpath.partition('?')[0]))

def test_nonregex(urlpath)
    return urlpath.partition("?")[0].strip("/").replace("/", ".")

def test_func(func, iterations, *args, **kwargs):
    for i in xrange(iterations):
        func(*args, **kwargs)

if __name__ == "__main__":
    import cProfile as profile

    urlpath = u'/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
    profile.run("test_func(orig_regex, 10000, urlpath)")
    profile.run("test_func(my_regex, 10000, urlpath)")
    profile.run("test_func(non_regex, 10000, urlpath)")

结果

Iterating orig_regex 10000 times
     60003 function calls in 0.108 CPU seconds

....

Iterating my_regex 10000 times
     130003 function calls in 0.087 CPU seconds

....

Iterating non_regex 10000 times
     40003 function calls in 0.019 CPU seconds

不做 re.compile 你的 5 正则表达式结果

running <function orig_regex at 0x100532050> 10000 times
     210817 function calls (210794 primitive calls) in 0.208 CPU seconds

score 2 · Accepted Answer

尝试这个：

s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
re.sub(r'\?.+', '', s).replace('/', '.')[1:]
> 'api.v4.path.apiCallTwo'

为了获得更好的性能，请编译一次正则表达式并重用它，如下所示：

regexp = re.compile(r'\?.+')
s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'

# `s` changes, but you can reuse `regexp` as many times as needed
regexp.sub('', s).replace('/', '.')[1:]

一种更简单的方法，不使用正则表达式：

s[1:s.index('?')].replace('/', '.')
> 'api.v4.path.apiCallTwo'

score 1 · Accepted Answer

逐行遍历：

您没有捕获或分组，因此不需要 and ，并且 the(不是Python 正则表达式中的特殊字符，因此不需要转义：)/

urlpath = re.sub("\?.*|/[0-9a-f]{24}", "", urlpath)

用 a 替换 a/后跟零重复/是没有意义的：

urlpath = re.sub("/[0-9/]+", "/", urlpath)

使用字符串方法更快地删除固定字符及其之后的所有内容：

urlpath = urlpath.partition(";")[0]

使用字符串方法用另一个固定字符串替换固定字符串也更快：

urlpath = urlpath.replace("/", ".")

urlpath = urlpath.replace(".api", "api")

score 0 · Accepted Answer

你确定你需要正则表达式吗？
IE，

urlpath = columns[4].strip()
urlpath = urlpath.split("?")[0]
urlpath = urlpath.replace("/", ".")

score 0 · Accepted Answer

您还可以编译 re 语句以获得性能提升，

例如

compiled_re_for_words = re.compile("\w+")
compiled_re_for_words.match("test")

python - 提高 python 正则表达式的性能

6 回答 6

Related

Reference