我正在比较使用pdist
和 DIY Jaccard 距离矩阵函数处理数据集时得到的 Jaccard 距离矩阵。我在输出距离矩阵中得到不同的结果,我不确定为什么。
我认为原因之一是:
- 我的杰卡德距离计算的实现是错误的
scipy.spatial.distance.pdist
(metric = 'jaccard')
并scipy.spatial.distance.jaccard
以不同的方式计算杰卡德距离(似乎不太可能,因为它们都在scipy.spatial.distance
)squareform
正在对我的数据做某事,可能是标准化
squareform 的文档让我有点不知所措,所以某种形式的规范化可能是正在发生的事情。但是,方形距离矩阵在单元格之间没有相同的相对距离幅度,这令人困惑(例如,我的 DIY 距离矩阵中的第 0 行是0, 0.571429, 1
,并且pdist
是0, 1, 1
- 中间值是 的两倍pdist
)。
谁能解释为什么我在用相同的度量分析时得到不同的距离矩阵?
我的代码:
import numpy as np
from scipy.spatial.distance import jaccard, squareform, pdist
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
#I don't care about every value in the array for my use case, so dont want to include them in my comparison
all_features = set([i for i in feature_list1 if i != filler_val])
all_features.update(set([i for i in feature_list2 if i != filler_val]))
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
data_array = np.array([[1, 2, 3, 4, 5],
[3, 4, 5, 6, 7],
[8, 9, 10, 11, 12]])
# =============================================================================
# DIY distance matrix
# =============================================================================
#set filler val to None, so the arrays being compared are equivalent to pdist
dist_diy = np.array([[jaccard_dissimilarity(a,b, None) for a in data_array] for b in data_array])
# =============================================================================
# pdist distance matrix
# =============================================================================
dist_pdist = squareform(pdist(data_array, metric = 'jaccard'))
输入数组:
1 2 3 4 5
3 4 5 6 7
8 9 10 11 12
dist_diy
:
0 0.571429 1
0.571429 0 1
1 1 0
dist_pdist
:
0 1 1
1 0 1
1 1 0