python - 在大量文件中搜索大量单词的最佳方法是什么？

Question

我有大约 5000 个文件，我需要从 10000 个单词的列表中找到每个文件中的单词。我当前的代码使用（非常）长的正则表达式来执行此操作，但速度非常慢。

wordlist = [...list of around 10000 english words...]
filelist = [...list of around 5000 filenames...]
wordlistre = re.compile('|'.join(wordlist), re.IGNORECASE)
discovered = []

for x in filelist:
    with open(x, 'r') as f:
        found = wordlistre.findall(f.read())
    if found:
        discovered = [x, found]

这以每秒大约 5 个文件的速度检查文件，这比手动执行要快得多，但仍然非常慢。有一个更好的方法吗？

score 0 · Accepted Answer

如果没有关于您的数据的更多信息，有几个想法是使用字典而不是列表，并减少搜索/排序所需的数据。如果您的分隔符不像以下那样干净，还可以考虑使用 re.split：

wordlist = 'this|is|it|what|is|it'.split('|')
d_wordlist = {}

for word in wordlist:
    first_letter = word[0]
    d_wordlist.setdefault(first_letter,set()).add(word)

filelist = [...list of around 5000 filenames...]
discovered = {}

for x in filelist:
    with open(x, 'r') as f:
        for word in f.read():
            first_letter = word[0]
            if word in d_wordlist[first_letter]:
                discovered.get(x,set()).add(word)

return discovered

score 0 · Accepted Answer

如果您可以grep在命令行上访问，则可以尝试以下操作：

grep -i -f wordlist.txt -r DIRECTORY_OF_FILES

您需要创建一个包含wordlist.txt所有单词的文件（每行一个单词）。

任何文件中与您的任何单词匹配的任何行都将以以下格式打印到 STDOUT：

<path/to/file>:<matching line>

score 0 · Accepted Answer

Aho-Corasick 算法正是为这种用法而设计的，并在fgrepUnix 中实现。使用 POSIX，该命令grep -F被定义为执行此功能。

它与常规grep的不同之处在于它只使用固定字符串（不是正则表达式），并且针对搜索大量字符串进行了优化。

要在大量文件上运行它，请在命令行上指定精确的文件，或通过以下方式传递它们xargs：

xargs -a filelist.txt grep -F -f wordlist.txt

的功能xargs是用尽可能多的文件填满命令行，并grep根据需要运行尽可能多的次数；

grep -F -f wordlist.txt (files 1 through 2,500 maybe)
grep -F -f wordlist.txt (files 2,501 through 5,000)

每次调用的精确文件数取决于各个文件名的长度以及ARG_MAX系统上常量的大小。

python - 在大量文件中搜索大量单词的最佳方法是什么？

3 回答 3

Related

Reference