pandas - HDF5 string serialization details in pandas?

Question

I am the author of Saddle (saddle.github.io), which provides functionality similar in spirit to pandas (but in Scala on the JVM). I'm trying to ensure that the HDF5 serialization format of pandas' DataFrame is interoperable with that of Saddle. I'm currently implementing string array serialization in Saddle. So my question is how the pandas DataFrame serializes strings. If I create an HDF5 file in pandas as follows:

from pandas import *
h = HDFStore('tmp.h5')
f = DataFrame({0: [1,2,3], 1: ["a", "b", "c"], 2: [1.5, 2.5, 3.5]})
h.put("f1", f)
h.close()

And h5dump the resulting tmp.h5 file, I see that the string block (block2_values) is stored as datatype H5T_VLEN and attribute

 ATTRIBUTE "CLASS" {
    DATATYPE  H5T_STRING {
          STRSIZE 8;
          STRPAD H5T_STR_NULLTERM;
          CSET H5T_CSET_ASCII;
          CTYPE H5T_C_S1;
       }
    DATASPACE  SCALAR
    DATA {
    (0): "VLARRAY"
    }
 }

This hints at an ASCII character set; however, the bytes I see encoded do not seem to correspond to ASCII (ie, "a", "b", "c"). Also, I'm curious where STRSIZE 8 comes from. Can anyone shed light on the implementation details of string serialization which occurs via pandas -> pytables -> hdf5? (I'd also be happy with any pointers to code in pandas/pytables where I can start digging deeper myself :)

score 6 · Accepted Answer

您选择了一个表面上看起来很简单，但实际上在幕后相当复杂的例子。这最终存储了 3 个不同的数据块（每个 dtype 1 个），并且每个存储和索引以及数据。

您存储的对象是我所说的Storer格式，这意味着 numpy 数组是一次写入的，因此一旦写入它们就无法更改。请参阅此处的文档：http: //pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables

PyTables 文档在这里：http ://pytables.github.io/usersguide/libref/declarative_classes.html#the-atom-class-and-its-descendants

不幸的是，这些字符串以这种特殊的存储格式存储为 python 泡菜，所以我不知道你是否可以跨平台解码它们。

您将更轻松地阅读Table对象，该对象使用更基本的类型存储，易于导出（例如，文档中有一节关于导出到 R）。

尝试阅读这种格式：

In [2]: df = DataFrame({0: [1,2,3], 1: ["a", "b", "c"], 2: [1.5, 2.5, 3.5]})

In [4]: h = pd.HDFStore('tmp.h5')

In [6]: h.put('df',df, table=True)

In [7]: h.close()

使用 PyTablesptdump -avd tmp.h5实用程序，这会产生以下结果。如果您正在阅读 < PyTables 3.0.0（刚刚发布）或 py3（我们将在 0.11.1 中支持）。然后字符串都是 utf-8 编码的，写成字节。在（PyTables 3.0.0，）之前，我相信字符串被写为 ascii。

/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.0',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    index_cols := [(0, 'index')],
    levels := 1,
    nan_rep := b'nan',
    non_index_axes := b"(lp1\n(I1\n(lp2\ncnumpy.core.multiarray\nscalar\np3\n(cnumpy\ndtype\np4\n(S'i8'\nI0\nI1\ntRp5\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp6\nag3\n(g5\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp7\nag3\n(g5\nS'\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp8\natp9\na.",
    pandas_type := b'frame_table',
    pandas_version := b'0.10.1',
    table_type := b'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1', 'values_block_2']]
/df/table (Table(3,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": StringCol(itemsize=1, shape=(1,), dflt=b'', pos=3)}
  byteorder := 'little'
  chunkshape := (2621,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 19 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0,
    FIELD_2_NAME := 'values_block_1',
    FIELD_3_FILL := b'',
    FIELD_3_NAME := 'values_block_2',
    NROWS := 3,
    TITLE := '',
    VERSION := '2.6',
    index_kind := b'integer',
    values_block_0_dtype := b'float64',
    values_block_0_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na.",
    values_block_1_dtype := b'int64',
    values_block_1_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na.",
    values_block_2_dtype := b'string8',
    values_block_2_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na."]
  Data dump:
[0] (0, [1.5], [1], [b'a'])
[1] (1, [2.5], [2], [b'b'])
[2] (2, [3.5], [3], [b'c'])

可能最好离线与我联系以进一步讨论。

pandas - HDF5 string serialization details in pandas?

1 回答 1

Related

Reference