0

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)

f(x, y) = 1, if characters x and y are similar
        = 0, otherwise

(These "similarities" can be specified by the programmer)

such that, say,

f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1

but,

f('a', 'z') = 0
etc.

Now if we have a query 'cofatyst', the algorithm should report the following matches:

('cot', 0)
('cat', 0)
('catalyst', 0)

where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).

One thing to note is that in the wild, the dictionary is going to be really large.

4

2 回答 2

1

我可能会尝试使用余弦相似度,使用每个字符的位置作为特征,并使用基于您的字符关系的匹配函数在特征之间映射乘积。

我知道,这不是一个非常具体的建议,但我希望它对您有所帮助。

编辑:扩展答案。

使用余弦相似度,您将计算两个向量的相似度。在您的情况下,标准化可能没有意义。所以,我要做的是非常简单的事情(我可能过于简单化了问题):首先,将 CxC 的矩阵视为具有两个字符相关概率的依赖矩阵(例如,P('t' | 'l' ) = 1)。这也将允许您具有部分依赖关系以区分完美匹配和部分匹配。在此之后,我将计算每个位置的每个单词的字母不相同的概率(使用 P(t_i, t_j) 的补码),然后您可以使用总和来聚合结果。

它将计算特定单词对的不同术语的数量,并允许您定义部分依赖关系。此外,实现非常简单,应该可以很好地扩展。这就是为什么我不确定我是否误解了你的问题。

于 2013-05-03T10:00:22.440 回答
0

我正在为我的一个项目使用Fuse JavaScript 库。它是一个适用于 JSON 数据集的 JavaScript 文件。它相当快。看看它。
它实现了一个完整的 Bitap 算法,利用了 Google 的 Diff、Match & Patch 工具的修改版本(来自他的网站)。

代码简单易懂,算法实现完成。

于 2013-05-02T09:02:39.780 回答