python - 在 python 中加速 numpy/scipy 中的矢量化相关函数？

Question

我编写了一个函数，该函数采用矩阵中列的成对相关性（如内置的pdistin scipy.stats），但可以处理参数指定的缺失值na_values。IE：

def my_pdist(X, dist_func, na_values=["NA"]):
    X = array(X, dtype=object)
    num_rows, num_cols = X.shape
    dist_matrix = []
    for col1 in range(num_cols):
        pdist_row = []
        for col2 in range(num_cols):
            pairs = array([[x, y] for x, y in zip(X[:, col1], X[:, col2]) \
                           if (x not in na_values) and (y not in na_values)])
            if len(pairs) == 0:
                continue
            dist = dist_func(pairs[:, 0],
                             pairs[:, 1])
            pdist_row.append(dist)
        dist_matrix.append(pdist_row)
    dist_matrix = array(dist_matrix)
    return dist_matrix

其中dist_func是指定距离度量的函数。有没有办法加快这个功能？使用它的一个例子是：

def spearman_dist(u, v, na_vals=["NA"]):
    matrix = [[x, y] for x, y in zip(u, v) \
              if (u not in na_vals) and (v not in na_vals)]
    matrix = array(matrix)
    spearman = scipy.stats.spearmanr(matrix[:, 0], matrix[:, 1])[0]
    return 1 - spearman

my_pdist(X, spearman_dist, na_values=["NA"])

如何将其矢量化/制作得更快？

score 3 · Accepted Answer

我有几个建议：

不要使用类型为“对象”的数组。这可以防止 numpy 使用其任何内置优化，因为它被迫对 python 对象而不是原始值进行操作。如果您使用浮点数组，那么您可以使用 np.nan 而不是 'NA'。对于整数数组，最好将一个好/坏值的掩码存储在一个单独的数组中（您也可以为此使用掩码数组，但我发现它们有点笨拙）。

我敢打赌，这条线占用了大部分时间：

pairs = array([[x, y] for x, y in zip(X[:, col1], X[:, col2]) \
                   if (x not in na_values) and (y not in na_values)])

所以你可以像这样加速内循环：

x1 = X[:, col1]
x2 = X[:, col2]
mask = ~np.isnan(x1) * ~np.isnan(x2)
if mask.sum() == 0:
    continue
dist = dist_func(x1[mask], x2[mask])

与其使用 list.append 构建 dist_matrix，不如从一个空数组开始，并在执行过程中填充元素：

dist_matrix = np.empty((num_cols, num_cols))
for col1 in range(num_cols):
    for col2 in range(num_cols):
        ...
        dist_matrix[col1, col2] = dist

由于您在 range(num_cols) 上进行了两次迭代，因此您实际上是在计算大多数距离值两次。这可以优化：

dist_matrix = np.empty((num_cols, num_cols))
for col1 in range(num_cols):
    for col2 in range(col1, num_cols):
        ...
        dist_matrix[col1, col2] = dist
        dist_matrix[col2, col1] = dist

完全不需要任何 for 循环就可以完成整个计算，但这取决于 dist_func 的细节。

score 0 · Accepted Answer

0

您可以尝试用 numpy 的Masked Arrays替换您的 na_vals 。

于 2012-08-13T08:25:39.777 回答

python - 在 python 中加速 numpy/scipy 中的矢量化相关函数？

2 回答 2

Related

Reference