python - 从 npy 文件加载稀疏数组

Question

我正在尝试加载我之前保存的稀疏数组。保存稀疏数组很容易。尝试阅读它虽然是一种痛苦。scipy.load 在我的稀疏数组周围返回一个 0d 数组。

import scipy as sp
A = sp.load("my_array"); A
array(<325729x325729 sparse matrix of type '<type 'numpy.int8'>'
with 1497134 stored elements in Compressed Sparse Row format>, dtype=object)

为了得到一个稀疏矩阵，我必须展平 0d 数组，或者使用 sp.asarray(A)。这似乎是一种非常困难的做事方式。Scipy 是否足够聪明，可以理解它已经加载了一个稀疏数组？有没有更好的方法来加载稀疏数组？

score 15 · Accepted Answer

scipy.io 中的mmwrite / mmread函数可以保存/加载 Matrix Market 格式的稀疏矩阵。

scipy.io.mmwrite('/tmp/my_array',x)
scipy.io.mmread('/tmp/my_array').tolil()

mmwritemmread可能就是您所需要的。它经过充分测试并使用众所周知的格式。

但是，以下可能会更快一些：

我们可以将行列坐标和数据保存为 npz 格式的一维数组。

import random
import scipy.sparse as sparse
import scipy.io
import numpy as np

def save_sparse_matrix(filename,x):
    x_coo=x.tocoo()
    row=x_coo.row
    col=x_coo.col
    data=x_coo.data
    shape=x_coo.shape
    np.savez(filename,row=row,col=col,data=data,shape=shape)

def load_sparse_matrix(filename):
    y=np.load(filename)
    z=sparse.coo_matrix((y['data'],(y['row'],y['col'])),shape=y['shape'])
    return z

N=20000
x = sparse.lil_matrix( (N,N) )
for i in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)

save_sparse_matrix('/tmp/my_array',x)
load_sparse_matrix('/tmp/my_array.npz').tolil()

这是一些代码，建议将稀疏矩阵保存在 npz 文件中可能比使用 mmwrite/mmread 更快：

def using_np_savez():    
    save_sparse_matrix('/tmp/my_array',x)
    return load_sparse_matrix('/tmp/my_array.npz').tolil()

def using_mm():
    scipy.io.mmwrite('/tmp/my_array',x)
    return scipy.io.mmread('/tmp/my_array').tolil()    

if __name__=='__main__':
    for func in (using_np_savez,using_mm):
        y=func()
        print(repr(y))
        assert(x.shape==y.shape)
        assert(x.dtype==y.dtype)
        assert(x.__class__==y.__class__)    
        assert(np.allclose(x.todense(),y.todense()))

产量

% python -mtimeit -s'import test' 'test.using_mm()'
10 loops, best of 3: 380 msec per loop

% python -mtimeit -s'import test' 'test.using_np_savez()'
10 loops, best of 3: 116 msec per loop

score 6 · Accepted Answer

可以使用 () 作为索引提取隐藏在 0d 数组中的对象：

A = sp.load("my_array")[()]

这看起来很奇怪，但它似乎无论如何都可以工作，而且这是一个非常短的解决方法。

score 1 · Accepted Answer

对于mmwrite答案的所有赞成票，我很惊讶没有人试图回答实际问题。但既然它已经被重新激活，我会试一试。

这重现了 OP 案例：

In [90]: x=sparse.csr_matrix(np.arange(10).reshape(2,5))
In [91]: np.save('save_sparse.npy',x)
In [92]: X=np.load('save_sparse.npy')
In [95]: X
Out[95]: 
array(<2x5 sparse matrix of type '<type 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format>, dtype=object)
In [96]: X[()].A
Out[96]: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [93]: X[()].A
Out[93]: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
In [94]: x
Out[94]: 
<2x5 sparse matrix of type '<type 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format

[()]`user4713166 给我们的不是提取稀疏数组的“硬方法” 。

np.save并被np.load设计为在 ndarrays 上运行。但是稀疏矩阵不是这样的数组，也不是子类（原样np.matrix）。似乎np.save将非数组对象包装在一个中object dtype array，并将其与对象的腌制形式一起保存。

当我尝试保存另一种无法腌制的对象时，我在以下位置收到错误消息：

403  # We contain Python objects so we cannot write out the data directly.
404  # Instead, we will pickle it out with version 2 of the pickle protocol.

--> 405 pickle.dump（数组，fp，协议=2）

所以回答Is Scipy smart enough to understand that it has loaded a sparse array?，不。np.load不知道稀疏数组。但是np.save当给定不是数组的东西时，它足够聪明，np.load可以用如果在文件中找到的东西做它可以做的事情。

至于保存和加载稀疏数组的替代方法io.savemat，已经提到了 MATLAB 兼容方法。这将是我的第一选择。但是这个例子也表明你可以使用常规的 Python pickling。如果您需要保存特定的稀疏格式，那可能会更好。np.save如果你能忍受[()]提取步骤，那也不错。:)

https://github.com/scipy/scipy/blob/master/scipy/io/matlab/mio5.py write_sparse - 稀疏以csc格式保存。除了标题之外，它还保存A.indices.astype('i4'))、A.indptr.astype('i4'))、A.data.real和可选A.data.imag的。

在快速测试中，我发现它可以np.save/load处理所有稀疏格式，除了dok抱怨load缺少shape. 否则我在稀疏文件中找不到任何特殊的酸洗代码。

python - 从 npy 文件加载稀疏数组

3 回答 3

Related

Reference