python - 从 Python 将字符串列表存储到 HDF5 数据集

Question

我正在尝试将可变长度的字符串列表存储到 HDF5 数据集。代码是

import h5py
h5File=h5py.File('xxx.h5','w')
strList=['asas','asas','asas']  
h5File.create_dataset('xxx',(len(strList),1),'S10',strList)
h5File.flush() 
h5File.Close()

我收到一条错误消息，指出“TypeError：dtype 没有转换路径：dtype('< U3')”，其中 < 表示实际小于符号
我该如何解决这个问题。

score 32 · Accepted Answer

您正在阅读 Unicode 字符串，但将您的数据类型指定为 ASCII。根据h5py wiki，h5py 目前不支持这种转换。

您需要以 h5py 句柄的格式对字符串进行编码：

asciiList = [n.encode("ascii", "ignore") for n in strList]
h5File.create_dataset('xxx', (len(asciiList),1),'S10', asciiList)

注意：并非所有用 UTF-8 编码的东西都可以用 ASCII 编码！

score 16 · Accepted Answer

来自https://docs.h5py.org/en/stable/special.html：

在 HDF5 中，VL 格式的数据存储为基本类型的任意长度向量。特别是，字符串以 C 风格存储在以 null 结尾的缓冲区中。NumPy 没有本地机制来支持这一点。不幸的是，这是在 HDF5 C API 和许多 HDF5 应用程序中表示字符串的事实标准。

值得庆幸的是，NumPy 具有“object”（“O”）dtype 形式的通用指针类型。在 h5py 中，可变长度字符串被映射到对象数组。附加到“O” dtype 的少量元数据告诉 h5py，当存储在文件中时，它的内容应该转换为 VL 字符串。

无需额外努力即可读取和写入现有的 VL 字符串；Python 字符串和固定长度的 NumPy 字符串可以自动转换为 VL 数据并存储。

例子

In [27]: dt = h5py.special_dtype(vlen=str)

In [28]: dset = h5File.create_dataset('vlen_str', (100,), dtype=dt)

In [29]: dset[0] = 'the change of water into water vapour'

In [30]: dset[0]
Out[30]: 'the change of water into water vapour'

score 5 · Accepted Answer

我处于类似的情况，希望将数据框的列名存储为 hdf5 文件中的数据集。假设 df.columns 是我想要存储的，我发现了以下作品：

h5File = h5py.File('my_file.h5','w')
h5File['col_names'] = df.columns.values.astype('S')

这假定列名是可以用 ASCII 编码的“简单”字符串。

python - 从 Python 将字符串列表存储到 HDF5 数据集

3 回答 3

Related

Reference