I am planning to use HDF to store a very large matrix, something like 1e6 x 1e6 of floats.
I would need to read the matrix in batches of consecutive rows or columns.
My question is, what would be the optimal way to structure/tweak the HDF file to maximize speed?
Some points:
I have estimated that reading/writing the full matrix uncompressed in HDF would take ~5 hours on my system. This is reasonable, but it is not reasonable to store the matrix uncompressed, since it will be several terabytes in size.
If the matrix is sparse, could compression cause reading speed to be comparable or even faster than reading an uncompressed dense matrix?
Breaking the matrix into separate submatrix datasets would be annoying, since it would complicate reading a row/column from the original matrix or doing things like matrix multiplication. So I would like to avoid this if possible (unless this gives a major speed advantage).
After reading the matrix once, I plan to read it many times. So read/decompression speed is more important than write/compression speed.
I am using python h5py to interface with the hdf.