您可以尝试使用带有字符串距离度量的bk-tree,例如levenshtein ,另请参阅http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees。
编辑:
发明距离度量很困难,但我碰巧知道一个可以使用的称为结构熵距离的度量,它基于信息差异。它的工作原理如下:
取两个字符串 x = "Elvis Pr" 和 y = "Elvis Aaron Presley"
为每个计算出一元和二元的多重集:
x = {e, l, v, i, s, _, p, r, el, lv, vi, is, s_, _p, pr}
y = {ex3, lx2, v, i, sx2, _x2, ax2, rx2, o, n, p, y, el, lv, vi, is, s_, _a, aa, ar, ro, on, n_, _p, pr, re, es, sl, le, ey}
现在对于两者中的那些条款
{e, l, v, i, s, _, p, r, el, lv, vi, is, s_, _p, pr}
(f_x(t) / (f_x(t) + f_y(t)))^{f_x(t)/2} * (f_y(t) / (f_x(t) + f_y(t)))^{f_y(t)/2}
这样计算产品
e = ((1/15) / (1/15 + 3/37))^(1/30) * ((3/37) / (1/15 + 3/37))^(3/74)
l = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
v = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
i = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
s = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
_ = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
p = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
r = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
el = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
lv = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
vi = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
is = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
s_ = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
_p = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
pr = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
将所有这些相乘,您应该得到一个范围为 [0.5, 1] 的数字,因此您可以通过乘以 2 并减去 1 更有效地将其缩放到范围 [0,1]。
但是,这不是离散距离度量,因此您将不得不使用另一个度量索引,例如vp-tree