python - python中搜索正文中多个项目的最快方法

Question

我有一长串短字符串，我想在（通常）长字符串中搜索所有这些项目。我的列表的长度约为 500 个短字符串，我想使用 python 查找大约 10,000 个字符长的源文本中出现的所有内容。

这是我的问题的一个简短示例：

cleanText = "four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition that all men are created equal"
searchList = ["years ago","dedicated to","civil war","brought forth"]

我目前在 cleanText 中查找 searchList 中的项目的方法是：

found = [phrase for phrase in searchList if phrase in cleanText]

这是python中最快的方法吗？它并不是很慢，但在规模上（searchList 中有 500 个项目，cleanText 的长度为 10,000 个字符），它似乎比我想要的要慢一些。

score 7 · Accepted Answer

你可以试试正则表达式。这可能会加快大型列表的速度：

import re
found = re.findall('|'.join(searchList),cleanText)

（当然，这假设其中没有任何searchList内容需要转义re。）

正如评论中指出的（感谢 anijhaw），您可以通过以下方式进行转义：

found = re.findall('|'.join(re.escape(x) for x in searchList), cleanText)

如果您将多次使用它，也可以预编译正则表达式，re.compile例如：。

regex = re.compile('|'.join(re.escape(x) for x in searchList))
found = regex.findall(cleanText)

免责声明这些解决方案只能找到不重叠的匹配项。

python - python中搜索正文中多个项目的最快方法

1 回答 1

Related

Reference