使用Jaccard 相似度。在下面的 Python 演示中,请记住函数cosine
和jaccard
返回距离,这是相似度的“倒数”,并阅读评论:
# Input all the data
In [19]: from scipy.spatial.distance import cosine, jaccard
In [24]: a
Out[24]: array([ 1, 1, 15, 2, 0])
In [25]: b
Out[25]: array([ 0, 0, 15, 0, 0])
In [26]: c
Out[26]: array([ 1, 1, 11, 0, 1])
# Calculate cosine similarity. I've scaled it by a factor of 100 for legibility
In [20]: 100*cosine(a,b)
Out[20]: 1.3072457560346473
In [21]: 100*cosine(c,a)
Out[21]: 1.3267032349480568
# Note c is slightly "further away" from a than b.
# Now let's see what Mr Jaccard has to say
In [28]: jaccard(a,b)
Out[28]: 0.75
In [29]: jaccard(a,c)
Out[29]: 0.59999999999999998
# Behold the desired effect- c is now considerably closer to a than b
# Sanity check- the distance between a and a is 0
In [30]: jaccard(a,a)
Out[30]: 0.0
PS 存在更多相似性度量,每种度量都适用于不同的情况。你有充分的理由相信c
应该a
比更相似b
吗?你的任务是什么?如果您想了解更多关于该主题的信息,我强烈推荐这篇博士论文。警告:200 页长。