python - 切片稀疏（scipy）矩阵

Question

我将不胜感激任何帮助，以了解从 scipy.sparse 包中切片 lil_matrix (A) 时的以下行为。

实际上，我想根据行和列的任意索引列表提取子矩阵。

当我使用这两行代码时：

x1 = A[list 1,:]
x2 = x1[:,list 2]

一切都很好，我可以提取正确的子矩阵。

当我尝试在一行中执行此操作时，它失败了（返回矩阵为空）

x=A[list 1,list 2]

为什么会这样？总的来说，我在 matlab 中使用了一个类似的命令，并且它在那里工作。那么，既然它有效，为什么不使用第一个呢？这似乎非常耗时。由于我必须浏览大量条目，因此我想使用单个命令加快速度。也许我使用了错误的稀疏矩阵类型......知道吗？

score 15 · Accepted Answer

您已经使用的方法，

A[list1, :][:, list2]

似乎是从备用矩阵中选择所需值的最快方法。请参阅下面的基准。

但是，要回答有关如何A 使用单个索引从任意行和列中选择值的问题，您需要使用所谓的“高级索引”：

A[np.array(list1)[:,np.newaxis], np.array(list2)]

使用高级索引，if arr1and arr2are NDarrays，equals的(i,j)组成部分A[arr1, arr2]

A[arr1[i,j], arr2[i,j]]

因此，您会希望对所有人都arr1[i,j]平等，并且对所有人都平等。list1[i]jarr2[i,j]list2[j]i

这可以通过设置 , 和借助广播（见下文）来安排。arr1 = np.array(list1)[:,np.newaxis]arr2 = np.array(list2)

的形状是 arr1广播的(len(list1), 1)形状，因为新的轴会在需要时自动添加到左侧。arr2(len(list2), )(1, len(list2))

每个数组都可以进一步广播成形(len(list1),len(list2))。这正是我们想要 A[arr1[i,j],arr2[i,j]]理解的，因为我们想要(i,j)遍历所有可能的索引以获取 shape 的结果数组(len(list1),len(list2))。

这是一个测试用例的微基准，表明这A[list1, :][:, list2]是最快的选择：

In [32]: %timeit orig(A, list1, list2)
10 loops, best of 3: 110 ms per loop

In [34]: %timeit using_listener(A, list1, list2)
1 loop, best of 3: 1.29 s per loop

In [33]: %timeit using_advanced_indexing(A, list1, list2)
1 loop, best of 3: 1.8 s per loop

这是我用于基准测试的设置：

import numpy as np
import scipy.sparse as sparse
import random
random.seed(1)

def setup(N):
    A = sparse.rand(N, N, .1, format='lil')
    list1 = np.random.choice(N, size=N//10, replace=False).tolist()
    list2 = np.random.choice(N, size=N//20, replace=False).tolist()
    return A, list1, list2

def orig(A, list1, list2):
    return A[list1, :][:, list2]

def using_advanced_indexing(A, list1, list2):
    B = A.tocsc()  # or `.tocsr()`
    B = B[np.array(list1)[:, np.newaxis], np.array(list2)]
    return B

def using_listener(A, list1, list2):
    """https://stackoverflow.com/a/26592783/190597 (listener)"""
    B = A.tocsr()[list1, :].tocsc()[:, list2]
    return B

N = 10000
A, list1, list2 = setup(N)
B = orig(A, list1, list2)
C = using_advanced_indexing(A, list1, list2)
D = using_listener(A, list1, list2)
assert np.allclose(B.toarray(), C.toarray())
assert np.allclose(B.toarray(), D.toarray())

score 4 · Accepted Answer

对我来说，来自 unutbu 的解决方案效果很好，但速度很慢。

我发现作为一种快速的替代方案，

A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]

你可以看到 row'S 和 col's 被分别切割，但每个都转换为最快的稀疏格式，以便这次获取索引。

在我的测试环境中，这段代码比其他代码快 1000 倍。

我希望，我不会说错话或犯错误。

score 2 · Accepted Answer

像 in 一样的同时索引B[arr1, arr2]确实有效，它比我机器上的侦听器解决方案更快。请参阅In [5]下面的 Jupyter 示例。要将其与上述答案进行比较，请参阅In [6]。此外，我的解决方案不需要.tocsc()转换，使其更具可读性 IMO。

请注意，B[arr1, arr2]为了工作，arr1必须arr2是可广播的numpy 数组。

然而，一个更快的解决方案B[list1][:, list2]是使用unutbu指出的。见In [7]下文。

In [1]: from scipy import sparse
      : import numpy as np
      : 
      : 

In [2]: B = sparse.rand(1000, 1000, .1, format='lil')
      : list1=[1,4,6,8]
      : list2=[2,4]
      : 
      : 

In [3]: arr1 = np.array(list1)[:, None]  # make arr1 a (n x 1)-array
      : arr1
      : 
      : 
Out[3]: 
array([[1],
       [4],
       [6],
       [8]])

In [4]: arr2 = np.array(list2)[None, :]  # make arr2 a (1 x m)-array
      : arr2
      : 
      : 
Out[4]: array([[2, 4]])

In [5]: %timeit A = B.tocsr()[arr1, arr2]
100 loops, best of 3: 13.1 ms per loop

In [6]: %timeit A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]
100 loops, best of 3: 14.6 ms per loop

In [7]: %timeit B[list1][:, list2]
1000 loops, best of 3: 205 µs per loop

score -1 · Accepted Answer

slicing happens with this syntax :

a[1:4]

for a = array([1,2,3,4,5,6,7,8,9]), the result is

array([2, 3, 4])

The first parameter of the tuple indicates the first value to be retained, and the second parameter indicates the first value not to be retained.

If you use lists on both sides, it means that your array has as many dimensions as the lists length.

So, with your syntax, you will probably need something like this :

x = A[list1,:,list2]

depending on the shape of A.

Hope it did help you.

python - 切片稀疏（scipy）矩阵

4 回答 4

Related

Reference