2

Question

Given a large series of DataFrames with a small variety of dtypes, what is the optimal design for Pandas DataFrame persistence/serialization if I care about compression ratio first, decompression speed second, and initial compression speed third?

Background:

I have roughly 200k dataframes of shape [2900,8] that I need to store in logical blocks of ~50 data frames per file. The data frame contains variables of type np.int8, np.float64. Most data frames are good candidates for sparse types, but sparse is not supported in HDF 'table' format stores (not that it would even help - see the size below for a sparse gzipped pickle). Data is generated daily and currently adds up to over 20GB. While I'm not bound to HDF, I have yet to find a better solution that allows for reads on individual dataframes within the persistent store, combined with top quality compression. Again, I'm willing to sacrifice a little speed for better compression ratios, especially since I will need to be sending this all over the wire.

There are a couple of other SO threads and other links that might be relevant for those that are in a similar position. However most of what I've found doesn't focus on minimizing storage size as a priority:

“Large data” work flows using pandas

HDF5 and SQLite. Concurrency, compression & I/O performance [closed]

Environment:

OSX 10.9.5
Pandas 14.1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  3.1.1
HDF5 version:      1.8.13
NumPy version:     1.8.1
Numexpr version:   2.4 (not using Intel's VML/MKL)
Zlib version:      1.2.5 (in Python interpreter)
LZO version:       2.06 (Aug 12 2011)
BZIP2 version:     1.0.6 (6-Sept-2010)
Blosc version:     1.3.5 (2014-03-22)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Cython version:    0.20.2
Python version:    2.7.8 (default, Jul  2 2014, 10:14:46)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
Platform:          Darwin-13.4.0-x86_64-i386-64bit
Byte-ordering:     little
Detected cores:    8
Default encoding:  ascii
Default locale:    (en_US, UTF-8)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Example:

import pandas as pd
import numpy as np
import random
import cPickle as pickle
import gzip


def generate_data():
    alldfs = {}
    n = 2800
    m = 8
    loops = 50
    idx = pd.date_range('1/1/1980',periods=n,freq='D')
    for x in xrange(loops):
        id = "id_%s" % x
        df = pd.DataFrame(np.random.randn(n,m) * 100,index=idx)
        # adjust data a bit..
        df.ix[:,0] = 0
        df.ix[:,1] = 0
        for y in xrange(100):
            i = random.randrange(n-1)
            j = random.randrange(n-1)
            df.ix[i,0] = 1
            df.ix[j,1] = 1
        df.ix[:,0] = df.ix[:,0].astype(np.int8)  # adjust datatype
        df.ix[:,1] = df.ix[:,1].astype(np.int8)
        alldfs[id] = df
    return alldfs

def store_all_hdf(x,format='table',complevel=9,complib='blosc'):
    fn = "test_%s_%s-%s.hdf" % (format,complib,complevel)
    hdfs = pd.HDFStore(fn,mode='w',format=format,complevel=complevel,complib=complib)
    for key in x.keys():
        df = x[key]
        hdfs.put(key,df,format=format,append=False)
    hdfs.close()

alldfs = generate_data()
for format in ['table','fixed']:
    for complib in ['blosc','zlib','bzip2','lzo',None]:
        store_all_hdf(alldfs,format=format,complib=complib,complevel=9)

# pickle, for comparison
with open('test_pickle.pkl','wb') as f:
    pickle.dump(alldfs,f)

with gzip.open('test_pickle_gzip.pklz','wb') as f:
    pickle.dump(alldfs,f)

with gzip.open('test_pickle_gzip_sparse.pklz','wb') as f:
    sparsedfs = {}
    for key in alldfs.keys():
        sdf = alldfs[key].to_sparse(fill_value=0)
        sparsedfs[key] = sdf
    pickle.dump(sparsedfs,f)

Results

-rw-r--r--   1 bazel  staff  10292760 Oct 17 14:31 test_fixed_None-9.hdf
-rw-r--r--   1 bazel  staff   9531607 Oct 17 14:31 test_fixed_blosc-9.hdf
-rw-r--r--   1 bazel  staff   7867786 Oct 17 14:31 test_fixed_bzip2-9.hdf
-rw-r--r--   1 bazel  staff   9506483 Oct 17 14:31 test_fixed_lzo-9.hdf
-rw-r--r--   1 bazel  staff   8036845 Oct 17 14:31 test_fixed_zlib-9.hdf
-rw-r--r--   1 bazel  staff  26627915 Oct 17 14:31 test_pickle.pkl
-rw-r--r--   1 bazel  staff   8752370 Oct 17 14:32 test_pickle_gzip.pklz
-rw-r--r--   1 bazel  staff   8407704 Oct 17 14:32 test_pickle_gzip_sparse.pklz
-rw-r--r--   1 bazel  staff  14464924 Oct 17 14:31 test_table_None-9.hdf
-rw-r--r--   1 bazel  staff   8619016 Oct 17 14:31 test_table_blosc-9.hdf
-rw-r--r--   1 bazel  staff   8154716 Oct 17 14:31 test_table_bzip2-9.hdf
-rw-r--r--   1 bazel  staff   8481631 Oct 17 14:31 test_table_lzo-9.hdf
-rw-r--r--   1 bazel  staff   8047125 Oct 17 14:31 test_table_zlib-9.hdf

Given the results above, the best 'compression-first' solution appears to be to store the data in HDF fixed format, with bzip2. Is there a better way of organising the data, perhaps without HDF, that would allow me to save even more space?

Update 1

Per the comment below from Jeff, I have used ptrepack on the table store HDF file without initial compression -- and then recompressed. Results are below:

-rw-r--r--   1 bazel  staff   8627220 Oct 18 08:40 test_table_repack-blocsc-9.hdf
-rw-r--r--   1 bazel  staff   8627620 Oct 18 09:07 test_table_repack-blocsc-blosclz-9.hdf
-rw-r--r--   1 bazel  staff   8409221 Oct 18 08:41 test_table_repack-blocsc-lz4-9.hdf
-rw-r--r--   1 bazel  staff   8104142 Oct 18 08:42 test_table_repack-blocsc-lz4hc-9.hdf
-rw-r--r--   1 bazel  staff  14475444 Oct 18 09:05 test_table_repack-blocsc-snappy-9.hdf
-rw-r--r--   1 bazel  staff   8059586 Oct 18 08:43 test_table_repack-blocsc-zlib-9.hdf
-rw-r--r--   1 bazel  staff   8161985 Oct 18 09:08 test_table_repack-bzip2-9.hdf

Oddly, recompressing with ptrepack seems to increase total file size (at least in this case using table format with similar compressors).

4

0 回答 0