创建值的所有组合Series
并在列表中获取hamming
距离,然后转换为数组并为 整形DataFrame
:
import textdistance
from itertools import product
L = [textdistance.hamming(x, y) for x , y in product(df, repeat=2)]
df = pd.DataFrame(np.array(L).reshape(len(df), len(df)))
print (df)
0 1 2 3 4
0 0 14 24 18 15
1 14 0 24 26 19
2 24 24 0 20 23
3 18 26 20 0 19
4 15 19 23 19 0
编辑:
为了提高性能,请将此解决方案与更改的 lambda 函数一起使用:
import numpy as np
from scipy.spatial.distance import pdist, squareform
# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1))
transformed_strings = np.array(df).reshape(-1,1)
# calculate condensed distance matrix by wrapping the hamming distance function
distance_matrix = pdist(transformed_strings,lambda x,y: textdistance.hamming(x[0],y[0]))
# get square matrix
df1 = pd.DataFrame(squareform(distance_matrix), dtype=int)
print (df1)
0 1 2 3 4
0 0 14 24 18 15
1 14 0 24 26 19
2 24 24 0 20 23
3 18 26 20 0 19
4 15 19 23 19 0