python - 合并 hdf5 文件

Question

我有许多 hdf5 文件，每个文件都有一个数据集。数据集太大而无法保存在 RAM 中。我想将这些文件组合成一个单独包含所有数据集的文件（即不要将数据集连接成一个数据集）。

一种方法是创建一个 hdf5 文件，然后一个一个地复制数据集。这将是缓慢而复杂的，因为它需要缓冲副本。

有没有更简单的方法来做到这一点？似乎应该有，因为它本质上只是创建一个容器文件。

我正在使用 python/h5py。

score 36 · Accepted Answer

这实际上是 HDF5 的用例之一。如果您只想能够从单个文件访问所有数据集，而不关心它们实际上是如何存储在磁盘上的，则可以使用外部链接。来自HDF5 网站：

外部链接允许组在另一个 HDF5 文件中包含对象，并使库能够访问这些对象，就像它们在当前文件中一样。通过这种方式，一个组可能看起来直接包含数据集、命名数据类型，甚至是实际位于不同文件中的组。此功能通过创建和管理链接、定义和检索外部对象路径以及解释链接名称的一组函数来实现：

以下是如何在 h5py 中执行此操作：

myfile = h5py.File('foo.hdf5','a')
myfile['ext link'] = h5py.ExternalLink("otherfile.hdf5", "/path/to/resource")

注意：打开时，如果它是现有文件myfile，则应打开它。'a'如果您使用打开它'w'，它将删除其内容。

这比将所有数据集复制到新文件中要快得多。我不知道访问速度otherfile.hdf5会有多快，但是对所有数据集的操作都是透明的——也就是说，h5py 会将所有数据集视为驻留在foo.hdf5.

score 16 · Accepted Answer

一种解决方案是使用HDF5 APIh5py的低级H5Ocopy 函数的接口，特别是h5py.h5o.copy 函数：

In [1]: import h5py as h5

In [2]: hf1 = h5.File("f1.h5")

In [3]: hf2 = h5.File("f2.h5")

In [4]: hf1.create_dataset("val", data=35)
Out[4]: <HDF5 dataset "val": shape (), type "<i8">

In [5]: hf1.create_group("g1")
Out[5]: <HDF5 group "/g1" (0 members)>

In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
Out[6]: <HDF5 dataset "val2": shape (), type "|O8">

In [7]: hf1.flush()

In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")

In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")

In [10]: hf2.values()
Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]

In [11]: hf2.get("newval").value
Out[11]: 35

In [12]: hf2.get("newg1").values()
Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]

In [13]: hf2.get("newg1").get("val2").value
Out[13]: 'Thing'

以上是从或多或少的 Debian Wheezy 的香草安装生成的h5py版本2.0.1-2+b1和 iPython 版本0.13.1-2+deb7u1在 Python 版本之上。在执行上述操作之前，2.7.3-4+deb7u1这些文件并不存在。f1.h5f2.h5 请注意，根据salotz，对于 Python 3，数据集/组名称需要是 bytes （例如 b"val"），而不是 str。

hf1.flush()in 命令至关重要，因为[7]低级接口显然总是从.h5存储在磁盘上的文件版本中提取，而不是从内存中缓存的版本中提取。将数据集复制到/从不在 a 根目录中的组File可以通过使用例如hf1.get("g1").id.

请注意，h5py.h5o.copy如果目标位置中已存在指定名称的对象，则会失败并出现异常（没有破坏）。

score 12 · Accepted Answer

我通过使用官方hdf5工具中的h5copy找到了一个非 python 解决方案。h5copy 可以将单个指定的数据集从 hdf5 文件复制到另一个现有的 hdf5 文件中。

如果有人找到基于 python/h5py 的解决方案，我会很高兴听到它。

score 2 · Accepted Answer

我通常同时使用ipython和h5copy工具，这比纯 python 解决方案要快得多。一旦安装了 h5copy。

控制台解决方案 MWE

#PLESE NOTE THIS IS IPYTHON CONSOLE CODE NOT PURE PYTHON

import h5py
#for every dataset Dn.h5 you want to merge to Output.h5 
f = h5py.File('D1.h5','r+') #file to be merged 
h5_keys = f.keys() #get the keys (You can remove the keys you don't use)
f.close() #close the file
for i in h5_keys:
        !h5copy -i 'D1.h5' -o 'Output.h5' -s {i} -d {i}

自动化控制台解决方案

假设您在文件夹中工作，要完全自动化该过程，并存储要合并的文件：

import os 
d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
   f = h5py.File(i,'r+')
   d_struct[i] = f.keys()
   f.close()

# A) empty all the groups in the new .h5 file 
for i in d_names:
    for j  in d_struct[i]:
        !h5copy -i '{i}' -o 'output.h5' -s {j} -d {j}

为每个添加的 .h5 文件创建一个新组

如果你想在 output.h5 中保持之前的数据集分开，你必须首先使用标志创建组 -p：

 # B) Create a new group in the output.h5 file for every input.h5 file
 for i in d_names:
        dataset = d_struct[i][0]
        newgroup = '%s/%s' %(i[:-3],dataset)
        !h5copy -i '{i}' -o 'output.h5' -s {dataset} -d {newgroup} -p
        for j  in d_struct[i][1:]:
            newgroup = '%s/%s' %(i[:-3],j) 
            !h5copy -i '{i}' -o 'output.h5' -s {j} -d {newgroup}

score 2 · Accepted Answer

为了对此进行更新，HDF5 版本 1.10 附带了一个新功能，在这种情况下可能很有用，称为“虚拟数据集”。
在这里你可以找到一个简短的教程和一些解释：虚拟数据集。
此处对该功能进行更完整和详细的说明和文档：
Virtual Datasets extra doc。
这里是 h5py 中的合并拉取请求，将虚拟数据集 API 包含到 h5py 中：
h5py Virtual Datasets PR但我不知道它是否已经在当前的 h5py 版本中可用，或者稍后会出现。

score 1 · Accepted Answer

要使用 Python（而不是 IPython）和 h5copy 来合并 HDF5 文件，我们可以基于GM 的答案：

import h5py
import os

d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
   f = h5py.File(i,'r+')
   d_struct[i] = f.keys()
   f.close()

for i in d_names:
   for j  in d_struct[i]:
      os.system('h5copy -i %s -o output.h5 -s %s -d %s' % (i, j, j))

python - 合并 hdf5 文件

6 回答 6

控制台解决方案 MWE

自动化控制台解决方案

为每个添加的 .h5 文件创建一个新组

Related

Reference