algorithm - 英文拼写错误更正序列

Question

我正在做一些搜索引擎。其中一个功能是尝试在没有找到任何东西的情况下更正拼写。我替换了以下语音序列：ph<->f, ee <-> i, oo<->u, ou<->o (color<->color)。我在哪里可以找到类似英语的完整列表？谢谢你。

score 2 · Accepted Answer

您可能想从这里开始（Soundex 上的维基百科），然后通过“另见”链接开始追踪。（例如，Metaphone 有一个替换列表。）

score 2 · Accepted Answer

如果您正在创建搜索引擎，您必须意识到有很多网页包含拼写错误的单词。但是，当然，您需要任何策略来使这些页面也可搜索。所以没有通用的规则来实现拼写校正器（因为正确性成为网络中的相对概念）。但是在实践中有一些技巧可以做到这一点:-)

我建议您使用n-gram index + Levenstein distance（或任何类似的距离）来纠正拼写。

列文斯坦距离小的字符串可能是同一个词的变体。

假设您要更正单词“fantoma”。如果您有大量单词 - 遍历字典并计算与每个单词的距离将非常昂贵。因此，您必须非常快速地找到与“fantoma”距离可能很小的单词。

主要思想是在抓取和索引网页时 - 将 n-gram（例如 - bigrams）索引到单独的索引中。将每个单词拆分为 n-gram，并将其添加到 n-gram 索引中：

1) Split each word from dictionary, 
   for example: "phantom" -> ["ph", "ha", "an", "nt", "to", "om"]

2) Create index:
   ...
   "ph" -> [ "phantom", "pharmacy", "phenol", ... ]
   "ha" -> [ "phantom", "happy" ... ]
   "an" -> [ "phantom", "anatomy", ... ]
   ...

现在 - 你有索引，你可以很快找到你的话的候选人。

例如：

1) "fantoma" -> ["fa", "an", "nt", "to", "om", "ma"]
2) get lists of words for each n-gram from index, 
   and extract most frequent words from these lists - these words are candidates
3) calculate Levenstein distance to each candidate, 
   the word with smallest distance is probably spell-corrected variant of searched word.

我建议你看一下《信息检索简介》这本书。

algorithm - 英文拼写错误更正序列

2 回答 2

Related

Reference