python - 使用正则表达式循环的更有效方法是什么？

Question

我有一个名称列表，用于从目标字符串列表中提取出来。例如：

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']

output = ['Chris Smith', 'Kim', 'CHRIS']

所以到目前为止的规则是：

不区分大小写
无法匹配部分单词（'ie Christmas/hijacked 不应该匹配 Chris/Jack）
只要根据上述条件在字符串中找到名称，字符串中的其他单词就可以了。

为此，另一位 SO 用户在此线程中建议了此代码：

[targ for targ in target_list if any(re.search(r'\b{}\b'.format(name), targ, re.I) for name in first_names)]

到目前为止，这非常准确，但是考虑到名称列表的长度约为 5,000 并且目标列表的长度范围为 20-100 行，其中一些字符串最长为 30 个字符，因此运行速度非常慢。

关于如何在这里提高性能的任何建议？

解决方案：两种基于正则表达式的解决方案都遭受了溢出错误，所以很遗憾我无法测试它们。有效的解决方案（来自@mglison 的回答）是：

new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]

这将性能从 15 秒大幅提高到不到 1 秒。

score 5 · Accepted Answer

似乎您可以将它们全部组合成 1 个超级正则表达式：

import re

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']

regex_string = '|'.join(r"(?:\b"+re.escape(x)+r"\b)" for x in names)
print regex_string
regex = re.compile(regex_string,re.I)
print [t for t in target if regex.search(t)]

一个非正则表达式解决方案，仅当名称是一个单词（无空格）时才有效：

new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]

表达式也可以any写成：

any(x in new_names for x in t.lower().split())

或者

any(x.lower() in new_names for x in t.split())

或者，另一个依赖的变体set.intersection（由下面的@DSM建议）：

[ t for t in target if new_names.intersection(t.lower().split()) ]

如果性能真的很关键，您可以分析以查看哪个性能最好，否则选择您认为最容易阅读/理解的一个。

*如果您使用的是python2.x，您可能想要使用itertools.imap而不是map如果您走上面的那条路线来让它懒惰地评估——这也让我想知道python是否提供了一个懒惰str.split的性能与非懒惰版本相提并论......

score 4 · Accepted Answer

这是我能想到的最简单的一个：

[item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]

全部一起：

import re

names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']

results = [item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]

print results
>>> 
['Chris Smith', 'Kim']

为了提高效率，您可以先编译正则表达式。

regex = re.compile( r'\b(%s)\b' % '|'.join(names) )
[item for item in target if regex.search(item)]

编辑

在考虑了这个问题并查看了一些评论之后，我将“解决方案”修改为以下内容：

import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex = re.compile( r'\b((%s))\b' % ')|('.join([re.escape(name) for name in names]), re.I )
results = [item for item in target if regex.search(item)]

结果：

>>> 
['Chris Smith', 'Kim', 'CHRIS']

score -1 · Accepted Answer

您当前正在另一个循环中执行一个循环，遍历两个列表。这总是会给你二次性能。

一种本地优化是编译每个名称正则表达式（这将使应用每个正则表达式更快）。但是，最大的胜利是将所有正则表达式组合成一个正则表达式，然后将其应用于输入中的每个项目。请参阅@mgilson 的答案以了解如何做到这一点。在那之后，你的代码性能应该线性扩展为 O(M+N)，而不是 O(M*N)。

python - 使用正则表达式循环的更有效方法是什么？

3 回答 3

Related

Reference