python - 如何构建距离或相异矩阵？

Question

我有一个 df 如下：

0    111155555511111116666611111111
1    555555111111111116666611222222
2    221111114444411111111777777777
3    111111116666666661111111111111
.......
1000  114444111111111111555555111111

我正在计算每个字符串之间的距离。例如，要获得前 2 个字符串之间的距离：textdistance.hamming(df[0], df[1]). 这将返回一个整数。

现在，我想创建一个 df 来存储每个字符串之间的所有距离。在这种情况下，由于我有 1000 个字符串，我将有一个 1000 x 1000 df。第一个值是字符串 1 与自身之间的距离，然后是字符串 1 和字符串 2，依此类推。然后在下一行它的字符串 2 和字符串 1，字符串 2 和它本身等等。

score 2 · Accepted Answer

创建值的所有组合Series并在列表中获取hamming距离，然后转换为数组并为整形DataFrame：

import textdistance
from  itertools import product

L = [textdistance.hamming(x, y) for x , y in product(df, repeat=2)]
df = pd.DataFrame(np.array(L).reshape(len(df), len(df)))
print (df)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0

编辑：

为了提高性能，请将此解决方案与更改的 lambda 函数一起使用：

import numpy as np    
from scipy.spatial.distance import pdist, squareform

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(df).reshape(-1,1)

# calculate condensed distance matrix by wrapping the hamming distance function
distance_matrix = pdist(transformed_strings,lambda x,y: textdistance.hamming(x[0],y[0]))

# get square matrix
df1 = pd.DataFrame(squareform(distance_matrix), dtype=int)
print (df1)
    0   1   2   3   4
0   0  14  24  18  15
1  14   0  24  26  19
2  24  24   0  20  23
3  18  26  20   0  19
4  15  19  23  19   0

python - 如何构建距离或相异矩阵？

1 回答 1

Related

Reference