python - 如何有效地计算 scipy.csr 稀疏矩阵中非零索引的成对交集？

Question

我有一个 scipy.sparse.csr 矩阵X，它是 nx p。对于X中的每一行，我想计算非零元素索引与X中的每一行的交集，并将它们存储在一个新的张量甚至字典中。例如，X是：

X = [
[0., 1.5, 4.7],
[4., 0., 0.],
[0., 0., 2.6]
]

我希望输出是

intersect = 
[
[[1,2], [], [2]],
[[], [0], []],
[[2], [], [2]]
]

intersect[i,j] 是一个 ndarray，表示 X 的第 i 行和第 j 行的非零元素索引的交集，即 X[i], X[j]。

目前我这样做的方式是循环，我想对其进行矢量化，因为它会更快并且计算是并行完成的。

# current code
n = X.shape[0]
intersection_dict = {}
for i in range(n):
    for j in range(n):
        indices = np.intersect1d(X[i].indices, X[j].indices)
        intersection_dict[(i,j)] = indices

我的 n 非常大，所以循环 n^2 非常差。我只是在想办法矢量化这个操作时遇到了麻烦。有人对如何解决这个问题有任何想法吗？

编辑： 很明显，我应该解释我要解决的问题，所以就在这里。

我正在解决一个优化问题并有一个方程 W = X diag(theta) X'。当我更新 theta 的条目直到收敛时，我想快速找到 W。此外，我正在使用 pytorch 更新参数，其中稀疏操作不像 scipy 那样广泛。

在哪里：

X : n x p sparse data matrix (n documents, p features)
theta : p x 1 parameter vector I want to learn and will be updating
X' : p x n transpose of sparse data matrix

note p >> n

我想到了两种快速解决这个问题的方法

缓存的稀疏外积（请参阅使用对角矩阵进行更有效的矩阵乘法）
W_ij = X_i * theta * X_j（X 的第 i 行，theta 和 X 的第 j 行的元素明智乘积。由于X_i, X_j是稀疏的，我在想如果我取非零索引的交集，那么我可以做一个简单的密集元素乘积（不支持稀疏元素明智乘积在 pytorch 中）的X_i[intersection indices] * theta[intersection indices] X_j[intersection indices]

我想尽可能多地对这种计算进行矢量化，而不是循环，因为我的 n 通常是数千，而 p 是 1100 万。

我正在尝试方法 2 而不是方法 1，因为 Pytorch 缺乏稀疏支持。主要是在更新 theta 的条目时，我不想进行稀疏密集或稀疏稀疏操作。我想做密集的操作。

score 1 · Accepted Answer

您正在查看的优化需要存储p不同的n x n矩阵。如果您确实想尝试一下，我可能会使用 scipy 的 C 扩展中稀疏矩阵中内置的所有功能。

import numpy as np
from scipy import sparse

arr = sparse.random(100,10000, format="csr", density=0.01)

xxt = arr @ arr.T
p_comps = [arr[:, i] @ arr.T[i, :] for i in range(arr.shape[1])]

def calc_weights(xxt, thetas, p_comps):
    xxt = xxt.copy()
    xxt.data = np.zeros(xxt.data.shape, dtype=xxt.dtype)
    for i, t in enumerate(thetas):
        xxt += (p_comps[i] * t)
    return xxt

W = calc_weights(xxt, np.ones(10000), p_comps)

>>>(xxt.A == W.A).all()
True

这真的不太可能在 python 中很好地实现。在 C 中执行此操作可能会更好，或者使用嵌套循环编写一些对元素进行操作并且可以使用 numba 编译 JIT 的东西。

score 1 · Accepted Answer

第一个简单的解决方案是注意输出矩阵是对称的：

n = X.shape[0]
intersection_dict = {}
for i in range(n):
    for j in range(i,n): #note the edit here
        indices = np.intersect1d(X[i].indices, X[j].indices)
        intersection_dict[(i,j)] = indices

这将使您的计算量减少不到 2 倍

python - 如何有效地计算 scipy.csr 稀疏矩阵中非零索引的成对交集？

2 回答 2

Related

Reference