我编写了一个函数,该函数采用矩阵中列的成对相关性(如内置的pdist
in scipy.stats
),但可以处理参数指定的缺失值na_values
。IE:
def my_pdist(X, dist_func, na_values=["NA"]):
X = array(X, dtype=object)
num_rows, num_cols = X.shape
dist_matrix = []
for col1 in range(num_cols):
pdist_row = []
for col2 in range(num_cols):
pairs = array([[x, y] for x, y in zip(X[:, col1], X[:, col2]) \
if (x not in na_values) and (y not in na_values)])
if len(pairs) == 0:
continue
dist = dist_func(pairs[:, 0],
pairs[:, 1])
pdist_row.append(dist)
dist_matrix.append(pdist_row)
dist_matrix = array(dist_matrix)
return dist_matrix
其中dist_func
是指定距离度量的函数。有没有办法加快这个功能?使用它的一个例子是:
def spearman_dist(u, v, na_vals=["NA"]):
matrix = [[x, y] for x, y in zip(u, v) \
if (u not in na_vals) and (v not in na_vals)]
matrix = array(matrix)
spearman = scipy.stats.spearmanr(matrix[:, 0], matrix[:, 1])[0]
return 1 - spearman
my_pdist(X, spearman_dist, na_values=["NA"])
如何将其矢量化/制作得更快?