python - Python：并行编译正则表达式

Question

我有一个程序需要编译数千个大型正则表达式，所有这些都将被多次使用。问题是，它们花费的时间太长（根据cProfiler113 秒）re.compile()。（顺便说一句，编译后实际上使用所有这些正则表达式进行搜索 < 1.3 秒。）

如果我不预编译，它只是将问题推迟到我实际搜索时，因为re.search(expr, text)隐式编译expr. 实际上，情况更糟，因为re每次我使用它们时都会重新编译整个正则表达式列表。

我尝试使用multiprocessing，但这实际上减慢了速度。这是一个小测试来演示：

## rgxparallel.py ##
import re
import multiprocessing as mp

def serial_compile(strings):
    return [re.compile(s) for s in strings]

def parallel_compile(strings):
    print("Using {} processors.".format(mp.cpu_count()))
    pool = mp.Pool()
    result = pool.map(re.compile, strings)
    pool.close()
    return result

l = map(str, xrange(100000))

还有我的测试脚本：

#!/bin/sh
python -m timeit -n 1 -s "import rgxparallel as r" "r.serial_compile(r.l)"
python -m timeit -n 1 -s "import rgxparallel as r" "r.parallel_compile(r.l)"
# Output:
#   1 loops, best of 3: 6.49 sec per loop
#   Using 4 processors.
#   Using 4 processors.
#   Using 4 processors.
#   1 loops, best of 3: 9.81 sec per loop

我猜并行版本是：

同时，编译和酸洗正则表达式，约 2 秒
在串行中，取消酸洗，因此重新编译它们，~6.5 秒

加上启动和停止进程的开销，在 4 个处理器上比串行慢multiprocessing25%以上。

我还尝试将正则表达式列表划分为 4 个子列表，并对子列表进行pool.map-ing，而不是单个表达式。这给性能带来了小幅提升，但我仍然无法比串行慢 25% 以上。

有没有比串行编译更快的方法？

编辑： 更正了正则表达式编译的运行时间。

我也尝试过使用threading，但由于 GIL，只使用了一个处理器。它略好于multiprocessing（130 秒对 136 秒），但仍比串行（113 秒）慢。

编辑 2： 我意识到一些正则表达式可能会被复制，所以我添加了一个 dict 来缓存它们。这缩短了约 30 秒。不过，我仍然对并行化感兴趣。目标机器有 8 个处理器，这会将编译时间减少到约 15 秒。

score 0 · Accepted Answer

有比多处理更轻的解决方案来获得任务执行的异步性，比如线程和协程。尽管 python2 的同时运行能力有限，但 python3 在其基本类型中主要使用这种异步实现。只需使用 python3 运行您的代码，您就会看到不同之处：

$ python2 --version
Python 2.7.17
$ python2 -m timeit -n 1 -s "import rgxparallel as r" "r.serial_compile(r.l)"
1 loops, best of 3: 3.65 sec per loop
$ python -m timeit -n 1 -s "import multire as r" "r.parallel_compile(r.l)"
1 loops, best of 3: 3.95 sec per loop

$ python3 --version
Python 3.6.9
$ python3 -m timeit -n 1 -s "import multire as r" "r.serial_compile(r.l)"
1 loops, best of 3: 0.72 usec per loop
$ python3 -m timeit -n 1 -s "import multire as r" "r.parallel_compile(r.l)"
...
1 loops, best of 3: 4.51 msec per loop

不要忘记更改 python3 版本的xrangeby range。

score -1 · Accepted Answer

尽管我很喜欢 python，但我认为解决方案是在 perl 中执行（例如，请参阅此速度比较）或 C 等。

如果您想将主程序保留在 python 中，您可以使用subprocess调用 perl 脚本（只需确保在尽可能少的调用中传递尽可能多的值subprocess以避免开销。

python - Python：并行编译正则表达式

2 回答 2

Related

Reference