python - 使用 Python 查找 2 个编号数据集之间的余弦相似度

Question

我对长度为 22 的数据集进行了编号，其中每个数字可以位于 0 到 1 之间，代表该属性的百分比。

[0.03, 0.15, 0.58, 0.1, 0, 0, 0.05, 0, 0, 0.07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 0]


[0.9, 0, 0.06, 0.02, 0, 0, 0, 0, 0.02, 0, 0, 0.01, 0, 0, 0, 0, 0.01, 0, 0, 0, 0, 0]


[0.01, 0.07, 0.59, 0.2, 0, 0, 0, 0, 0, 0.05, 0, 0, 0, 0, 0, 0, 0.07, 0, 0, 0, 0, 0]


[0.55, 0.12, 0.26, 0.01, 0, 0, 0, 0.01, 0.02, 0, 0, 0.01, 0, 0, 0.01, 0, 0.01, 0, 0, 0, 0, 0]


[0, 0.46, 0.43, 0.05, 0, 0, 0, 0, 0, 0, 0, 0.02, 0, 0, 0, 0, 0.02, 0.02, 0, 0, 0, 0]

如何使用 Python 计算这两个数据集之间的余弦相似度？

score 4 · Accepted Answer

根据余弦相似度的定义，您只需要计算两个向量的归一化点积a和b：

import numpy as np

a = [0.03, 0.15, 0.58, 0.1, 0, 0, 0.05, 0, 0, 0.07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 0]
b = [0.9, 0, 0.06, 0.02, 0, 0, 0, 0, 0.02, 0, 0, 0.01, 0, 0, 0, 0, 0.01, 0, 0, 0, 0, 0]

print np.dot(a, b) / np.linalg.norm(a) / np.linalg.norm(b)

输出：

0.115081383219

score 1 · Accepted Answer

不依赖于 numpy 你可以去

result = (sum(ax*bx for ax, bx in a, b) /
          (sum(ax**2 for ax in a) +
           sum(bx**2 for bx in b))**0.5)

score 0 · Accepted Answer

您可以直接使用该方法sklearn

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.asmatrix([1,2,3]), np.asmatrix([4,5,6]))[0][0]

输出

0.97463184619707621

注意（由于numpy方法通常对矩阵进行操作）如果不使用 np.asmatrix()，则会收到以下警告

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample

并且要将最终值作为标量，您需要[0][0]在输出上使用，

python - 使用 Python 查找 2 个编号数据集之间的余弦相似度

3 回答 3

Related

Reference