python - 无法将 DataFrame 保存到 HDF5（“对象标头消息太大”）

Question

我在 Pandas 中有一个 DataFrame：

In [7]: my_df
Out[7]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34 entries, 0 to 0
Columns: 2661 entries, airplane to zoo
dtypes: float64(2659), object(2)

当我尝试将其保存到磁盘时：

store = pd.HDFStore(p_full_h5)
store.append('my_df', my_df)

我得到：

  File "H5A.c", line 254, in H5Acreate2
    unable to create attribute
  File "H5A.c", line 503, in H5A_create
    unable to create attribute in object header
  File "H5Oattribute.c", line 347, in H5O_attr_create
    unable to create new attribute in header
  File "H5Omessage.c", line 224, in H5O_msg_append_real
    unable to create new message
  File "H5Omessage.c", line 1945, in H5O_msg_alloc
    unable to allocate space for message
  File "H5Oalloc.c", line 1142, in H5O_alloc
    object header message is too large

End of HDF5 error back trace

Can't set attribute 'non_index_axes' in node:
 /my_df(Group) u''.

为什么？

注意：如果重要，DataFrame 列名是简单的小字符串：

In[12]: max([len(x) for x in list(my_df.columns)])
Out{12]: 47

这一切都与 Pandas 0.11 和 IPython、Python 和 HDF5 的最新稳定版本有关。

score 15 · Accepted Answer

对于列的所有元数据，HDF5 的标头限制为 64kb。这包括名称、类型等。当您处理大约 2000 列时，您将用完存储所有元数据的空间。这是 pytables 的一个基本限制。我不认为他们会在短期内做出变通办法。您要么必须拆分表，要么选择另一种存储格式。

score 8 · Accepted Answer

尽管该线程已有 5 年以上的历史，但问题仍然存在。仍然无法将超过 2000 列的 DataFrame 作为一个表保存到 HDFStore 中。format='fixed'如果您想选择稍后从 HDFStore 读取哪些列，则使用不是一种选择。

这是一个将 DataFrame 拆分为较小的并将它们存储为单独的表的函数。此外，将 apandas.Series放入包含列所属表的信息的 HDFStore。

def wideDf_to_hdf(filename, data, columns=None, maxColSize=2000, **kwargs):
    """Write a `pandas.DataFrame` with a large number of columns
    to one HDFStore.

    Parameters
    -----------
    filename : str
        name of the HDFStore
    data : pandas.DataFrame
        data to save in the HDFStore
    columns: list
        a list of columns for storing. If set to `None`, all 
        columns are saved.
    maxColSize : int (default=2000)
        this number defines the maximum possible column size of 
        a table in the HDFStore.

    """
    import numpy as np
    from collections import ChainMap
    store = pd.HDFStore(filename, **kwargs)
    if columns is None:
        columns = data.columns
    colSize = columns.shape[0]
    if colSize > maxColSize:
        numOfSplits = np.ceil(colSize / maxColSize).astype(int)
        colsSplit = [
            columns[i * maxColSize:(i + 1) * maxColSize]
            for i in range(numOfSplits)
        ]
        _colsTabNum = ChainMap(*[
            dict(zip(columns, ['data{}'.format(num)] * colSize))
            for num, columns in enumerate(colsSplit)
        ])
        colsTabNum = pd.Series(dict(_colsTabNum)).sort_index()
        for num, cols in enumerate(colsSplit):
            store.put('data{}'.format(num), data[cols], format='table')
        store.put('colsTabNum', colsTabNum, format='fixed')
    else:
        store.put('data', data[columns], format='table')
    store.close()

使用上述函数存储到 HDFStore 中的 DataFrame 可以使用以下函数读取。

def read_hdf_wideDf(filename, columns=None, **kwargs):
    """Read a `pandas.DataFrame` from a HDFStore.

    Parameter
    ---------
    filename : str
        name of the HDFStore
    columns : list
        the columns in this list are loaded. Load all columns, 
        if set to `None`.

    Returns
    -------
    data : pandas.DataFrame
        loaded data.

    """
    store = pd.HDFStore(filename)
    data = []
    colsTabNum = store.select('colsTabNum')
    if colsTabNum is not None:
        if columns is not None:
            tabNums = pd.Series(
                index=colsTabNum[columns].values,
                data=colsTabNum[columns].data).sort_index()
            for table in tabNums.unique():
                data.append(
                    store.select(table, columns=tabsNum[table], **kwargs))
        else:
            for table in colsTabNum.unique():
                data.append(store.select(table, **kwargs))
        data = pd.concat(data, axis=1).sort_index(axis=1)
    else:
        data = store.select('data', columns=columns)
    store.close()
    return data

score 5 · Accepted Answer

截至 2014 年，hdf 已更新

如果您使用的是 HDF5 1.8.0 或更早版本，则数量有限制
您可以在复合数据类型中拥有的字段。
这是由于对象标头消息的 64K 限制，其中数据类型被编码。（但是，您可以在它失败之前创建很多字段。
在失败之前，一位用户能够在复合数据类型中创建多达 1260 个字段。）

至于pandas，它可以使用选项保存具有任意列数的 Dataframe format='fixed'，格式“表”仍然会引发与主题中相同的错误。我也试过h5py了，也得到了“标题太大”的错误（尽管我的版本> 1.8.0）。

score 0 · Accepted Answer

###USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.

##############################################################################

#Assuming that this is your model architecture. However, you may use 
#whatever architecture, you want to (big or small; any).
def mymodel():
    inputShape= (28, 28, 3);
    model= Sequential()
    model.add(Conv2D(20, 5, padding="same", input_shape=inputShape))
    model.add(Activation('relu'))
    model.add(Flatten())
    model.add(Dense(500))
    model.add(Activation('relu'))
    model.add(Dense(2, activation= "softmax"))
    return model
model.fit(....)    #paramaters to start training your model




################################################################################
################################################################################
#once your model has been trained, you want to save your model in your PC
#use get_weights() command to get your model weights
weigh= model.get_weights()

#now, use pickle to save your model weights, instead of .h5
#for heavy model architectures, .h5 file is unsupported.
pklfile= "D:/modelweights.pkl"
try:
    fpkl= open(pklfile, 'wb')    #Python 3     
    pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
    fpkl.close()
except:
    fpkl= open(pklfile, 'w')    #Python 2      
    pickle.dump(weigh, fpkl, protocol= pickle.HIGHEST_PROTOCOL)
    fpkl.close()




################################################################################
################################################################################
#in future, you may want to load your model back
#use pickle to load model weights

pklfile= "D:/modelweights.pkl"
try:
    f= open(pklfile)     #Python 2

    weigh= pickle.load(f);                
    f.close();
except:

    f= open(pklfile, 'rb')     #Python 3                 
    weigh= pickle.load(f);                
    f.close();

restoredmodel= mymodel()
#use set_weights to load the modelweights into the model architecture
restoredmodel.set_weights(weigh)




################################################################################
################################################################################
#now, you can do your testing and evaluation- predictions
y_pred= restoredmodel.predict(X)

python - 无法将 DataFrame 保存到 HDF5（“对象标头消息太大”）

4 回答 4

Related

Reference