As a test, I'm trying to read a small 25 mg csv file using pandas.HDFStore:
store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
store.append('df',chunk)
store.close()
It causes my computer to thrash and when it finally completes, file.h5 is 6.7 gigs. I don't know what is causing the file size to balloon: when I look at the store afterwards, the only thing in there is the small dataframe. If I read the csv in without chunking and then add it to the store, I have no problems.
Update 1: I'm running Anaconda, using python 2.7.6, HDF5 version 1.8.9, numpy 1.8.0, pytables 3.1.0, pandas 13.1, ubuntu 12.04. The data is proprietary, so I can't post the chunk information online. I do have some mixed types. It still crashes if I try to read everything in as object.
Update 2: Dropped all the columns with mixed type and I'm still getting the same issue. I have some very large text columns if that makes any difference.
Update 3: The problem seems to be loading the dataframe into the hdfstore. I drastically reduced the size of my file, but kept one of my very wide columns (1259 characters). Whereas the size of the csv file is 878.6kb, the size of the hdfstore is 53 megs. Is pytables unable to handle very wide columns? Is there a threshold above which I should truncate?