python - 如何合并保存元数据的不同matlab mat文件以在python中使用？

Question

我有 1,000 多个非常长的matlab向量（长度不等~ 10^8 个样本），代表来自不同患者和来源的数据。我希望将它们紧凑地组织在一个文件中，以便以后在python. 我希望每个样本以某种方式保存其他信息（患者 ID、采样频率等）。

顺序应该是：

Hospital 1:
   Pat. 1:
      vector:sample 1
      vector:sample 2

   Pat. 2:
      vector:sample 1
      vector:sample 2


Hospital 2:
   Pat. 1:
      vector:sample 1
      vector:sample 2
    .
    .
    .

我想过将样本转换为hdf5文件类型并添加元数据，然后将几个hdf5文件合并为一个文件，但我遇到了困难。

已经尝试过：

matlab：高级 hdf5 matlab 函数。
matlab：将变量保存为 v7.3 mat（实际上是 hdf5）
蟒蛇：sidekit_io.h5merge

公开征求意见！

score 1 · Accepted Answer

关于您上面给出的格式，您可能希望将向量存储在矩阵中。对于医院的患者样本：2，pat_ID：3455679，年龄：34，high_blood_pressure：NO（0 二进制），您可以将其存储为“患者 ID”、“医院编号”、“年龄”、“high_blood_pressure”... 2,3455679,34,0,...

a = [1:10]' %vector 1
b = [1:10]' %vector 2
c = [a,b]   %matrix holding vecotrs 1 and 2

score 0 · Accepted Answer

我看到 HDF5 至少有两种方法。您可以将所有数据复制到一个文件中。千兆字节的数据对于 HDF5 来说不是问题（如果有足够的资源）。或者，您可以将患者数据保存在单独的文件中，并使用外部链接指向中央 HDF5 文件中的数据。创建链接后，您可以“如同”访问该文件中的数据。下面显示的两种方法都是使用 Numpy 随机创建的小而简单的“样本”。每个样本都是一个数据集，包括带有医院、患者和样本 ID 的属性。

方法1：单个文件中的所有数据

num_h = 3
num_p = 5
num_s = 2

with h5py.File('SO_59556149.h5', 'w') as h5f:

    for h_cnt in range(num_h):
        for p_cnt in range(num_p):
            for s_cnt in range(num_s):
                ds_name = 'H_' + str(h_cnt) + \
                          '_P_' + str(p_cnt) + \
                          '_S_' + str(s_cnt)
                # Create sample vector data and add to a dataset
                vec_arr = np.random.rand(1000,1)
                dset = h5f.create_dataset(ds_name, data=vec_arr )
                # add attributes of Hospital, Patient and Sample ID
                dset.attrs['Hospital ID']=h_cnt
                dset.attrs['Patient ID']=p_cnt
                dset.attrs['Sample ID']=s_cnt

方法 2：在单独文件中指向患者数据的外部链接

num_h = 3
num_p = 5
num_s = 2

with h5py.File('SO_59556149_link.h5', 'w') as h5f:

    for h_cnt in range(num_h):
        for p_cnt in range(num_p):
            fname = 'SO_59556149_' + 'H_' + str(h_cnt) + '_P_' + str(p_cnt) + '.h5'
            h5f2 = h5py.File(fname, 'w')
            for s_cnt in range(num_s):
                ds_name = 'H_' + str(h_cnt) + \
                          '_P_' + str(p_cnt) + \
                          '_S_' + str(s_cnt)
                # Create sample vector data and add to a dataset
                vec_arr = np.random.rand(1000,1)
                dset = h5f2.create_dataset(ds_name, data=vec_arr )
            # add attributes of Hospital, Patient and Sample ID
                dset.attrs['Hospital ID']=h_cnt
                dset.attrs['Patient ID']=p_cnt
                dset.attrs['Sample ID']=s_cnt
                h5f[ds_name] = h5py.ExternalLink(fname, ds_name)
            h5f2.close()

python - 如何合并保存元数据的不同matlab mat文件以在python中使用？

2 回答 2

Related

Reference