我在该主题上找到的最好的学术资源是《海量数据集挖掘》的第 3 章,它对局部敏感散列和 minhashing 进行了很棒的概述。
非常简单地说,这个想法是获取几个字符串,将这些字符串向量化,然后在结果向量上传递一个滑动窗口。如果两个向量在相同的窗口位置具有相同的值,则将它们标记为更细粒度的相似性分析的候选。
Python 数据草图库 ( pip install datasketch
) 中有一个很棒的实现。这是一个示例,显示您可以捕获模糊字符串相似性:
from datasketch import MinHash, MinHashLSH
from nltk import ngrams
data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
'weights controls the relative importance between minizing false positive',
'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]
# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.4, num_perm=128)
# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
minhash = MinHash(num_perm=128)
for d in ngrams(i, 3):
minhash.update("".join(d).encode('utf-8'))
lsh.insert(c, minhash)
minhashes[c] = minhash
for i in xrange(len(minhashes.keys())):
result = lsh.query(minhashes[i])
print "Candidates with Jaccard similarity > 0.4 for input", i, ":", result
这将返回:
Candidates with Jaccard similarity > 0.4 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.4 for input 3 : [2, 3]