python - 使用python在大文本文件中搜索字符串的快速方法

Question

这就是我现在的情况：

我有一个 2.5MB 的文本文件，包含大约 250k 个字符串，按字母顺序排序
每个字符串都是唯一的
我不需要修改文本文件中的条目：一旦加载了文本文件，它就永远不会被编辑
文本文件在开始时加载，然后我只需要通过它搜索字符串

最后一点是问题。实际上我需要搜索字符串的完全匹配和部分匹配。我写的算法只是涉及使用正则表达式，并结合了一些尝试使过程更快：例如，我将识别字母表中单数字母的字典索引硬编码到我的脚本中，然后拆分大文本文件虚构成 26 个小字典。那完全没用，脚本仍然非常慢。浏览了这里的一些帖子，我被说服尝试 mmap：但是在给定正则表达式的情况下，找到所有部分匹配项看起来毫无用处。最终我得出结论，尝试可以解决我的问题，尽管我几乎不知道这是什么。我应该尝试吗？如果是这样，我应该如何继续在 python 中创建 trie？marisa-trie 模块好吗？感谢大家

编辑：通过“部分匹配”，我的意思是我有一个字符串的前缀。我不需要在最后或中间进行匹配，只需要在开始时。

score 5 · Accepted Answer

最简单最快的解决方案：

#!/usr/bin/env python

d = {}

# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
    line = line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        d[line[:i]] = True
f.close()


while True:
    w = raw_input('> ')
    if not w:
        break

    if w in d:
        print "match found", w

这里稍微复杂一些，但内存效率更高：

#!/usr/bin/env python

d = []

def binary_search(a, x, lo=0, hi=None):
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        midval = a[mid]
        if midval < x:
            lo = mid+1
        elif midval > x:
            hi = mid
        else:
            return mid
    return -1


f = open("/etc/hosts","r")
for line in f:
    line=line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        x = hash(line[:i])
        d.append(x)
f.close()

d.sort()

while True:
    w = raw_input('> ')
    if not w:
        break

    if binary_search(d, hash(w)) != -1:
        print "match found", w

score 2 · Accepted Answer

由于文件已经排序和读入，您可以对其使用二进制搜索，而无需求助于任何花哨的数据结构。Python 有一个内置的二分搜索函数bisect.bisect_left`。

score 1 · Accepted Answer

使用trie。

#dictionary is a list of words
def parse_dictionary(dictionary):
    dictionary_trie = {}
    for word in dictionary:
        tmp_trie = dictionary_trie
        for letter in word:
            if letter not in tmp_trie:
                tmp_trie[letter] = {}
            if 'words' not in tmp_trie[letter]:
                tmp_trie[letter]['words'] = []

            tmp_trie[letter]['words'].append(word)
            tmp_trie = tmp_trie[letter]
    return dictionary_trie

def matches(substring, trie):
    d = trie
    for letter in substring:
        try:
            d = d[letter]
        except KeyError:
            return []
    return d['words']

使用示例：

>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
       'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
                   'words': ['hello']},
             'words': ['hello']},
       'words': ['hello', 'hai']},
 't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
                                     'words': ['testing']},
                               'words': ['testing']},
                         'words': ['test', 'testing']},
                   'words': ['test', 'testing']},
             'words': ['test', 'testing']},
       'words': ['test', 'testing']},
 'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
                   'words': ['world']},
             'words': ['world']},
       'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>>

score 0 · Accepted Answer

So to explain arainchi's very nice answer, make a dictionary with an entry for every line in your file. Then you can match your search string against the names of those entries. Dictionaries are really handy for this kind of searching.

score 0 · Accepted Answer

使用 trie 仍然需要您构建一个 trie，它是 O(n) 来迭代整个文件——利用排序将使其成为 O(log_2 n)。因此，这个更快的解决方案将使用二进制搜索（见下文）。

此解决方案仍然需要您阅读整个文件。在更快的解决方案中，您可以预处理文件并填充所有行，使它们的长度相同（或在文件中构建某种索引结构，以使查找列表中间可行） - - 然后寻找文件的中间会带你到列表的中间。“更快”的解决方案可能只需要非常非常大的文件（千兆字节或数百兆字节）。您将他们与二进制搜索结合起来。

可能，如果文件系统支持稀疏文件——执行上述填充方案不会增加磁盘上使用的文件实际块。

然后，此时，您可能正在接近 b-tree 或 b+tree 实现以提高索引效率。所以你可以使用b-tree library。

像这样的东西：

import bisect

entries = ["a", "b", "c", "cc", "cd", "ce", "d", "e", "f" ]

def find_matches(ls, m):

    x = len(ls) / 2
    match_index = -1

    index = bisect.bisect_left(ls, m)
    matches = []

    while ls[index].startswith(m):
        matches.append(ls[index])
        index += 1

    return matches

print find_matches(entries, "c")

输出：

>>> ['c', 'cc', 'cd', 'ce']

score 0 · Accepted Answer

0

您可以制作一个列表，让每一行成为列表的一个元素并进行二进制搜索。

于 2013-02-22T23:14:43.963 回答

python - 使用python在大文本文件中搜索字符串的快速方法

6 回答 6

Related

Reference