simhash 捕获(小)字符串之间的相似性。但它并没有真正解决在大于 RAM 的数据集中查询最相似字符串的问题。我认为,原始论文建议对一些排列进行索引,它需要大量内存并且没有利用 OKVS 的有序特性。
我想我找到了一个允许在哈希前缀内捕获相似性的哈希:
In [1]: import fuzz
In [2]: hello = fuzz.bbkh("hello")
In [3]: helo = fuzz.bbkh("helo")
In [4]: hellooo = fuzz.bbkh("hellooo")
In [5]: salut = fuzz.bbkh("salut")
In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix
Out[6]: 213
In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))
Out[7]: 12
In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))
Out[8]: 0
在对 wikidata 标签进行小规模测试后,它似乎给出了很好的结果:
$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france 0
** farrance -2
** freande -2
** defrance -2
real 0m0.054s
$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace -1
** france -2
** fernacre -2
real 0m0.060s
$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium -2
real 0m0.047s
$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium 0
** ajbelgium -2
real 0m0.059s
$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium -1
** beijum -2
real 0m0.047s
这是一个实现:
HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)
chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])
def ngram(string, n):
return [string[i:i+n] for i in range(len(string)-n+1)]
def integer2booleans(integer):
return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
def merkletree(booleans):
assert len(booleans) == HASH_SIZE
length = (2 * len(booleans) - 1)
out = [False] * length
index = length - 1
booleans = list(reversed(booleans))
while len(booleans) > 1:
for boolean in booleans:
out[index] = boolean
index -= 1
new = []
for (right, left) in chunks(booleans, 2):
value = right or left
new.append(value)
booleans = new
return out
def bbkh(string):
integer = 0
string = "$" + string + "$"
for gram in ngram(string, 2):
hotbit = ONE_HOT_ENCODER.index(gram)
hotinteger = 1 << hotbit
integer = integer | hotinteger
booleans = integer2booleans(integer)
tree = merkletree(booleans)
fuzz = ''.join('1' if x else '0' for x in tree)
buzz = int(fuzz, 2)
hash = buzz.to_bytes(BBKH_LENGTH, 'big')
return hash
def lcp(a, b):
"""Longest Common Prefix between a and b"""
out = []
for x, y in zip(a, b):
if x == y:
out.append(x)
else:
break
return ''.join(out)
评估:使用维基百科常见的拼写错误单词,大约有 8k 个单词。考虑前 10 个最接近的词,每个查询大约需要 20 毫秒,成功率达到 77%。考虑到前 100 名,每个查询花费的时间不到 200 毫秒,成功率为 94%。大多数错误来自连接词(例如“abouta”而不是“about a”)或用破折号分隔的词。
在https://github.com/amirouche/fuzzbuzz/blob/master/typofix.py查看代码
注意:计算 simhash 而不是输入字符串,仅适用于一袋引理或词干,实际上它会找到类似的文档。