python - 保存大型 Scipy 稀疏矩阵

Question

我正在尝试 cPickle 一个大型 scipy 稀疏矩阵以供以后使用。我收到此错误：

  File "tfidf_scikit.py", line 44, in <module>
    pickle.dump([trainID, trainX, trainY], fout, protocol=-1)
SystemError: error return without exception set

trainX是大型稀疏矩阵，另外两个是 6 百万个元素长的列表。

In [1]: trainX
Out[1]:
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
    with 286674296 stored elements in Compressed Sparse Row format>

此时，Python RAM 使用量为 4.6GB，我的笔记本电脑上有 16GB 的 RAM。

我想我遇到了一个已知的 cPickle 内存错误，它不适用于太大的对象。我也尝试过marshal，但我认为它不适用于 scipy 矩阵。有人可以提供一个解决方案，最好是一个关于如何加载和保存它的例子吗？

Python 2.7.5

操作系统 10.9

谢谢你。

score 1 · Accepted Answer

I had this problem with a multi-gigabyte Numpy matrix (Ubuntu 12.04 with Python 2.7.3 - seems to be this issue: https://github.com/numpy/numpy/issues/2396 ).

I've solved it using numpy.savetxt() / numpy.loadtxt(). The matrix is compressed adding a .gz file extension when saving.

Since I too had just a single matrix I did not investigate the use of HDF5.

score 0 · Accepted Answer

在 Python 2.7 上，两者numpy.savetxt（仅适用于数组，不适用于稀疏矩阵）和sklearn.externals.joblib.dump（酸洗、慢得要命并且会占用大量内存）都不适用于我。

相反，我使用scipy.sparse.save_npz并且效果很好。请记住，它仅适用于、csc、csr或矩阵。bsrdiacoo

python - 保存大型 Scipy 稀疏矩阵

2 回答 2

Related

Reference