python - How to use HDF to store a very large matrix

Question

I am planning to use HDF to store a very large matrix, something like 1e6 x 1e6 of floats.

I would need to read the matrix in batches of consecutive rows or columns.

My question is, what would be the optimal way to structure/tweak the HDF file to maximize speed?

Some points:

I have estimated that reading/writing the full matrix uncompressed in HDF would take ~5 hours on my system. This is reasonable, but it is not reasonable to store the matrix uncompressed, since it will be several terabytes in size.
If the matrix is sparse, could compression cause reading speed to be comparable or even faster than reading an uncompressed dense matrix?
Breaking the matrix into separate submatrix datasets would be annoying, since it would complicate reading a row/column from the original matrix or doing things like matrix multiplication. So I would like to avoid this if possible (unless this gives a major speed advantage).
After reading the matrix once, I plan to read it many times. So read/decompression speed is more important than write/compression speed.
I am using python h5py to interface with the hdf.

score 0 · Accepted Answer

我假设您已经在使用一些稀疏表示，例如来自 scipy.sparse 的 lil_matrix。

我看到两个合理的选择

1) 您可以使用 cPickle.dump 将二进制内容转储到文件中，请参见 Python：如何使用 python 存储稀疏矩阵？

2）您可以使用 cPickle 将内容转储到使用 cPickle.dumps 的字符串，然后使用 h5py 作为字符串存储您的内容

一般来说，处理大量数据的成本很高。当使用例如 lil_matrix 时，矩阵上的操作是昂贵的，读取/写入磁盘需要花费时间以存储数据的方式。使用包含字符串的 HDF5 到原始 C 文件的开销是没有的（如果您关闭压缩）。我建议你关闭表达式，因为它不会减少太多的大小（它已经很稀疏了）。

1 回答 1