algorithm - How to fuzzily search for a dictionary word?

Question

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)

f(x, y) = 1, if characters x and y are similar
        = 0, otherwise

(These "similarities" can be specified by the programmer)

such that, say,

f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1

but,

f('a', 'z') = 0
etc.

Now if we have a query 'cofatyst', the algorithm should report the following matches:

('cot', 0)
('cat', 0)
('catalyst', 0)

where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).

One thing to note is that in the wild, the dictionary is going to be really large.

score 1 · Accepted Answer

我可能会尝试使用余弦相似度，使用每个字符的位置作为特征，并使用基于您的字符关系的匹配函数在特征之间映射乘积。

我知道，这不是一个非常具体的建议，但我希望它对您有所帮助。

编辑：扩展答案。

使用余弦相似度，您将计算两个向量的相似度。在您的情况下，标准化可能没有意义。所以，我要做的是非常简单的事情（我可能过于简单化了问题）：首先，将 CxC 的矩阵视为具有两个字符相关概率的依赖矩阵（例如，P('t' | 'l' ) = 1)。这也将允许您具有部分依赖关系以区分完美匹配和部分匹配。在此之后，我将计算每个位置的每个单词的字母不相同的概率（使用 P(t_i, t_j) 的补码），然后您可以使用总和来聚合结果。

它将计算特定单词对的不同术语的数量，并允许您定义部分依赖关系。此外，实现非常简单，应该可以很好地扩展。这就是为什么我不确定我是否误解了你的问题。

score 0 · Accepted Answer

我正在为我的一个项目使用Fuse JavaScript 库。它是一个适用于 JSON 数据集的 JavaScript 文件。它相当快。看看它。
它实现了一个完整的 Bitap 算法，利用了 Google 的 Diff、Match & Patch 工具的修改版本（来自他的网站）。

代码简单易懂，算法实现完成。

algorithm - How to fuzzily search for a dictionary word?

2 回答 2

Related

Reference