1

I am planning to use HDF to store a very large matrix, something like 1e6 x 1e6 of floats.

I would need to read the matrix in batches of consecutive rows or columns.

My question is, what would be the optimal way to structure/tweak the HDF file to maximize speed?

Some points:

  • I have estimated that reading/writing the full matrix uncompressed in HDF would take ~5 hours on my system. This is reasonable, but it is not reasonable to store the matrix uncompressed, since it will be several terabytes in size.

  • If the matrix is sparse, could compression cause reading speed to be comparable or even faster than reading an uncompressed dense matrix?

  • Breaking the matrix into separate submatrix datasets would be annoying, since it would complicate reading a row/column from the original matrix or doing things like matrix multiplication. So I would like to avoid this if possible (unless this gives a major speed advantage).

  • After reading the matrix once, I plan to read it many times. So read/decompression speed is more important than write/compression speed.

  • I am using python h5py to interface with the hdf.

4

1 回答 1

0

我假设您已经在使用一些稀疏表示,例如来自 scipy.sparse 的 lil_matrix。

我看到两个合理的选择

1) 您可以使用 cPickle.dump 将二进制内容转储到文件中,请参见 Python:如何使用 python 存储稀疏矩阵?

2)您可以使用 cPickle 将内容转储到使用 cPickle.dumps 的字符串,然后使用 h5py 作为字符串存储您的内容

一般来说,处理大量数据的成本很高。当使用例如 lil_matrix 时,矩阵上的操作是昂贵的,读取/写入磁盘需要花费时间以存储数据的方式。使用包含字符串的 HDF5 到原始 C 文件的开销是没有的(如果您关闭压缩)。我建议你关闭表达式,因为它不会减少太多的大小(它已经很稀疏了)。

于 2014-04-15T14:42:07.887 回答