1

As a test, I'm trying to read a small 25 mg csv file using pandas.HDFStore:

store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
     store.append('df',chunk)
store.close()

It causes my computer to thrash and when it finally completes, file.h5 is 6.7 gigs. I don't know what is causing the file size to balloon: when I look at the store afterwards, the only thing in there is the small dataframe. If I read the csv in without chunking and then add it to the store, I have no problems.

Update 1: I'm running Anaconda, using python 2.7.6, HDF5 version 1.8.9, numpy 1.8.0, pytables 3.1.0, pandas 13.1, ubuntu 12.04. The data is proprietary, so I can't post the chunk information online. I do have some mixed types. It still crashes if I try to read everything in as object.

Update 2: Dropped all the columns with mixed type and I'm still getting the same issue. I have some very large text columns if that makes any difference.

Update 3: The problem seems to be loading the dataframe into the hdfstore. I drastically reduced the size of my file, but kept one of my very wide columns (1259 characters). Whereas the size of the csv file is 878.6kb, the size of the hdfstore is 53 megs. Is pytables unable to handle very wide columns? Is there a threshold above which I should truncate?

4

1 回答 1

0

宽对象列绝对是问题所在。我的解决方案是在读取对象列时截断它们。如果我截断到 20 个字符的宽度,h5 文件只有 csv 文件的两倍大。但是,如果我截断到 100 个字符,h5 文件大约会大 6 倍。

我在下面包含我的代码作为答案,但如果有人知道如何减少这种大小差异而不必截断这么多文本,我将不胜感激。

store = pd.HDFStore(filepath, 'w')
for chunk in pd.read_csv(f, chunksize=5000, sep='\t',
                         na_values="null", error_bad_lines=False):

    chunk = chunk.apply(truncateCol)
    store.append(table, chunk)

def truncateCol(ser, width=100):
    if ser.dtype == np.object:
        ser = ser.str[:width] if ser.str.len().max() > width else ser
    return ser
于 2014-04-01T18:38:20.537 回答