3

I am trying to find a way to build a fuzzy search where both the text database and the queries may have spelling variants. In particular, the text database is material collected from the web and likely would not benefit from full text engine's prep phase (word stemming) I could imagine using pg_trgm as a starting point and then validate hits by Levenshtein. However, people tend to do prefix queries E.g, in the realm of music, I would expect "beetho symphony" to be a reasonable search term. So, is someone were typing "betho symphony", is there a reasonable way (using postgresql with perhaps tcl or perl scripting) to discover that the "betho" part should be compared with "beetho" (returning an edit distance of 1)

4

2 回答 2

1

我最终得到的是对通用算法的简单修改:通常我会从矩阵或向量对中获取最后一个值。参考http://en.wikipedia.org/wiki/Levenshtein_distance中的“迭代”算法,我将要探测的字符串作为第一个参数,将查询字符串作为第二个参数。现在,当算法完成时,结果列中的最小值给出了正确的结果

示例结果:查询“fantas”,数据库中的单词“fantasy”,“fantastic” => 0 查询“fantas”,数据库中的单词“fan” => 3

编辑距离的输入是从基于三元相似度的“大多数单词”列表中选择的单词

于 2013-04-22T16:23:49.353 回答
0

您可以修改编辑距离算法,为字符串的后半部分赋予较低的权重。

例如:对于每个 i&j,Match(i,j) = 1/max(i,j)^2 而不是 Match(i,j)=1。(i 和 j 是您要比较的符号的位置)。

它的作用是:dist('ABCD', 'ABCE') < dist('ABCD', 'EBCD')。

于 2013-04-16T17:45:46.477 回答