python - 最快的类字典匹配

Question

我将不得不在 Python 中执行类似拼写检查的操作，如下所示：

我有一个巨大的单词列表（我们称之为词典）。我现在得到了一些文本（我们称之为样本）。我必须在词典中搜索每个示例单词。如果我找不到它，则该示例词是错误的。

简而言之 - 强力拼写检查器。然而，在词典中线性搜索每个样本词势必会很慢。有什么更好的方法来做到这一点？

复杂的因素是样本和词典都不是英文的。它是一种语言，而不是 26 个字符，可以有超过 300 个字符 - 以 Unicode 存储。

任何算法/数据结构/并行化方法的建议都会有所帮助。以低于 100% 的准确度为代价的高速算法将是完美的，因为我不需要 100% 的准确度。我知道 Norvig 的算法，但它似乎是特定于英语的。

score 6 · Accepted Answer

您可以使用一组 Unicode 字符串：

s = set(u"rabbit", u"lamb", u"calf")

并使用in运算符检查是否出现单词：

>>> u"rabbit" in s
True
>>> u"wolf" in s
False

这种查找本质上是 O(1)，因此字典的大小无关紧要。

编辑：这是（区分大小写）拼写检查器（2.6 或更高版本）的完整代码：

from io import open
import re
with open("dictionary", encoding="utf-8") as f:
    words = set(line.strip() for line in f)
with open("document", encoding="utf-8") as f:
    for w in re.findall(r"\w+", f.read()):
        if w not in words:
            print "Misspelled:", w.encode("utf-8")

（print假设您的终端使用 UTF-8。）

score 1 · Accepted Answer

就像每个人都告诉你的那样，用一组试试。集合查找由经验丰富的程序员在 python 的 C 代码中进行了优化，因此您无法在您的小应用程序中做得更好。

Unicode 不是问题：集合和字典键可以是 unicode 或英文文本，没关系。您唯一需要考虑的可能是 unicode 规范化，因为不同的变音符号顺序不会比较相等。如果这对您的语言来说是个问题，我会首先确保词典以标准化形式存储，然后在检查之前对每个单词进行标准化。例如，unicodedata.normalize('NFC', word)

score 1 · Accepted Answer

这就是集合到位的地方。创建字典中所有单词的集合，然后使用成员运算符检查字典中是否存在该单词。

这是一个简化的例子

>>> dictionary = {'Python','check-like', 'will', 'perform','follows:', 'spelling', 'operation'}
>>> for word in "I will have to perform a spelling check-like operation in Python as follows:".split():
    if word in dictionary:
        print "Found {0} in the dictionary".format(word)
    else:
        print "{0} not present in the dictionary".format(word)


I not present in the dictionary
Found will in the dictionary
have not present in the dictionary
to not present in the dictionary
Found perform in the dictionary
a not present in the dictionary
Found spelling in the dictionary
Found check-like in the dictionary
Found operation in the dictionary
in not present in the dictionary
Found Python in the dictionary
as not present in the dictionary
Found follows: in the dictionary
>>>

score 1 · Accepted Answer

使用树结构来存储单词，这样从根到叶的每条路径都代表一个单词。如果您的遍历无法到达叶子，或者在单词结尾之前到达叶子，则您的词典中没有单词。

除了 Emil 在评论中提到的好处之外，还请注意，这允许您执行诸如回溯之类的操作以查找替代拼写。

score 0 · Accepted Answer

这是我写的关于检查这些事情的帖子。让谷歌建议/拼写检查器工作是相似的。

http://blog.mattalcock.com/2012/12/5/python-spell-checker/

希望能帮助到你。

score 0 · Accepted Answer

python字典中散列搜索的平均时间复杂度为O(1)。因此，您可以使用“没有值的字典”（又名集合）

score 0 · Accepted Answer

这就是 python 字典和集合的用途！:) 如果每个单词都有一些值（比如频率），则将您的词典存储在字典中，或者如果您只需要检查是否存在，则将其存储在字典中。搜索它们是 O(1)，所以它会非常快。

lex = set(('word1', 'word2', .....))

for w in words:
    if w not in lex:
        print "Error: %s" % w

score 0 · Accepted Answer

首先，您需要为您的词典创建索引。例如，您可以制作自己的索引系统，但更好的方法是使用全文搜索引擎全文搜索引擎我可能会为您推荐 apache lucene 或 sphinx。它既快速又开源。在您可以将搜索查询从 python 发送到搜索引擎并捕获回复之后。

python - 最快的类字典匹配

8 回答 8

Related

Reference