我正在尝试使用 scikit 的最近邻实现从随机值矩阵中找到与给定列向量最近的列向量。
此代码应该找到第 21 列的最近邻居,然后检查这些邻居与第 21 列的实际余弦相似度。
from sklearn.neighbors import NearestNeighbors
import sklearn.metrics.pairwise as smp
import numpy as np
test=np.random.randint(0,5,(50,50))
nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=smp.cosine_similarity).fit(test)
distances, indices = nbrs.kneighbors(test)
x=21
for idx,d in enumerate(indices[x]):
sim2 = smp.cosine_similarity(test[:,x],test[:,d])
print "sklearns cosine similarity would be ", sim2
print 'sklearns reported distance is', distances[x][idx]
print 'sklearns if that distance was cosine, the similarity would be: ' ,1- distances[x][idx]
输出看起来像
sklearns cosine similarity would be [[ 0.66190748]]
sklearns reported distance is 0.616586738214
sklearns if that distance was cosine, the similarity would be: 0.383413261786
所以kneighbors的输出既不是余弦距离也不是余弦相似度。是什么赋予了?
另外,顺便说一句,我认为 sklearn 的最近邻实现不是近似最近邻方法,但与我迭代矩阵并检查时得到的结果相比,它似乎没有检测到我的数据集中的实际最佳邻居第 211 列与所有其他列的相似之处。我在这里误解了一些基本的东西吗?