python - 在非索引文本文件中搜索单词的最快方法 - Python

Question

考虑一个 150 万行的文本文件，每行大约 50-100 个单词。

要查找包含该单词的行，使用os.popen('grep -w word infile')似乎比

for line in infile: 
  if word in line:
    print line

否则如何在 python 的文本文件中搜索一个单词？搜索该大型未索引文本文件的最快方法是什么？

score 2 · Accepted Answer

有几种快速搜索算法（参见维基百科）。他们要求您将单词编译成某种结构。Grep 正在使用Aho-Corasick 算法。

我还没有看到 python 的源代码，in但要么

word为需要时间的每一行编译（我怀疑in编译任何东西，显然它可以编译它，缓存结果等），或者
搜索效率低下。考虑在“worword”中搜索“word”，首先检查“worw”并失败，然后检查“o”，然后检查“r”并失败，等等。但没有理由重新检查“o”或“r”如果你很聪明。例如，Knuth-Morris-Pratt 算法根据搜索到的单词创建一个表，告诉它发生失败时可以跳过多少个字符。

score 1 · Accepted Answer

我可能会建议安装和使用the_silver_searcher。

在我的测试中，它搜索了约 1GB 的文本文件，包含约 2900 万行，仅在 00h 00m 00.73 秒内就找到了数百个搜索词条目，即不到一秒！

这是 Python 3 代码，它使用它来搜索单词并计算找到它的次数：

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-wc", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE).stdout.read()
print("Found entries:", output.rstrip().decode('ascii'))

此版本搜索单词并打印行号+找到单词的实际文本：

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-w", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE)

for line in output.stdout.readlines():
    print(line.rstrip().decode('ascii'))

python - 在非索引文本文件中搜索单词的最快方法 - Python

2 回答 2

Related

Reference