python - 使用 h5py 访问数据范围

Question

我有一个包含 62 个不同属性的 h5 文件。我想访问他们每个人的数据范围。

在这里解释更多我在做什么

import h5py 
the_file =  h5py.File("myfile.h5","r")
data = the_file["data"]
att = data.keys()

前面的代码给了我一个属性列表“U”，“T”，“H”，......等

假设我想知道“U”的最小值和最大值是多少。我怎样才能做到这一点？

这是运行“h5dump -H”的输出

HDF5 "myfile.h5" {
GROUP "/" {
   GROUP "data" {
      ATTRIBUTE "datafield_names" {
         DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_SPACEPAD;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
         DATASPACE  SIMPLE { ( 62 ) / ( 62 ) }
      }
      ATTRIBUTE "dimensions" {
         DATATYPE  H5T_STD_I32BE
         DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      }
      ATTRIBUTE "time_variables" {
         DATATYPE  H5T_IEEE_F64BE
         DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      }
      DATASET "Temperature" {
         DATATYPE  H5T_IEEE_F64BE
         DATASPACE  SIMPLE { ( 256, 512, 1024 ) / ( 256, 512, 1024 ) }
      }

score 10 · Accepted Answer

这可能是术语上的差异，但 hdf5 属性是通过attrsDataset 对象的属性访问的。我称你有变量或数据集。反正...

我根据您的描述猜测属性只是数组，您应该能够执行以下操作来获取每个属性的数据，然后像任何 numpy 数组一样计算最小值和最大值：

attr_data = data["U"][:] # gets a copy of the array
min = attr_data.min()
max = attr_data.max()

因此，如果您想要每个属性的最小值/最大值，您可以对属性名称进行 for 循环，或者您可以使用

for attr_name,attr_value in data.items():
    min = attr_value[:].min()

编辑以回答您的第一条评论：

h5py 的对象可以像python 字典一样使用。因此，当您使用“keys()”时，您实际上并没有获取数据，而是获取了该数据的名称（或密钥）。例如，如果您运行，the_file.keys()您将获得该 hdf5 文件根路径中每个 hdf5 数据集的列表。如果您继续沿着一条路径前进，您最终会得到包含实际二进制数据的数据集。因此，例如，您可以从（首先在解释器中）开始：

the_file = h5py.File("myfile.h5","r")
print the_file.keys()
# this will result in a list of keys maybe ["raw_data","meta_data"] or something
print the_file["raw_data"].keys()
# this will result in another list of keys maybe ["temperature","humidity"]
# eventually you'll get to the dataset that actually has the data or attributes you are looking for
# think of this process as going through a directory structure or a path to get to a file (or a dataset/variable in this case)
the_data_var = the_file["raw_data"]["temperature"]
the_data_array = the_data_var[:]

print the_data_var.attrs.keys()
# this will result in a list of attribute names/keys
an_attr_of_the_data = data_var.attrs["measurement_time"][:]

# So now you have "the_data_array" which is a numpy array and "an_attr_of_the_data" which is whatever it happened to be
# you can get the min/max of the data by doing like before
print the_data_array.min()
print the_data_array.max()

编辑 2 - 为什么人们以这种方式格式化他们的 hdf 文件？它违背了目的。

如果可能的话，我认为您可能必须与制作此文件的人交谈。如果你做到了，那么你就可以自己回答我的问题了。首先，您确定在您的原始示例中data.keys()返回了"U","T",etc.吗？除非 h5py 正在做一些神奇的事情，或者如果你没有提供 h5dump 的所有输出，那不可能是你的输出。我将解释 h5dump 告诉我的内容，但请尝试理解我在做什么，而不仅仅是复制并粘贴到您的终端中。

# Get a handle to the "data" Group
data = the_file["data"]
# As you can see from the dump this data group has 3 attributes and 1 dataset
# The name of the attributes are "datafield_names","dimensions","time_variables"
# This should result in a list of those names:
print data.attrs.keys()

# The name of the dataset is "Temperature" and should be the only item in the list returned by:
print data.keys()

从 h5dump 中可以看出，有 62 个datafield_names（字符串）、4 个dimensions（我认为是 32 位整数）和 2 个time_variables（64 位浮点数）。它还告诉我这Temperature是一个 3 维数组，256 x 512 x 1024（64 位浮点数）。你看到我从哪里得到这些信息了吗？现在是困难的部分，您需要确定如何与阵列datafield_names匹配。Temperature这是由制作文件的人完成的，因此您必须弄清楚Temperature数组中每一行/列的含义。我的第一个猜测是Temperature数组中的每一行都是datafield_names，也许每次还有2个？但这不起作用，因为数组中的行太多。也许尺寸适合那里一些如何？最后，这里是您如何获取每条信息（从之前继续）：

# Get the temperature array (I can't remember if the 3 sets of colons is required, but try it and if not just use one)
temp_array = data["Temperature"][:,:,:]
# Get all of the datafield_names (list of strings of length 62)
datafields = data.attrs["datafield_names"][:]
# Get all of the dimensions (list of integers of length 4)
dims = data.attrs["dimensions"][:]
# Get all of the time variables (list of floats of length 2)
time_variables = data.attrs["time_variables"]

# If you want the min/max of the entire temperature array this should work:
print temp_array.min()
print temp_array.max()
# If you knew that row 0 of the array had the temperatures you wanted to analyze
# then this would work, but it all depends on how the creator organized the data/file:
print temp_array[0].min()
print temp_array[1].max()

很抱歉，我无法提供更多帮助，但实际上没有文件并且不知道每个字段的含义，这就是我所能做的。尝试了解我是如何使用 h5py 来阅读信息的。尝试了解我如何将标头信息（h5dump 输出）转换为我可以实际用于 h5py 的信息。如果您知道数据在数组中的组织方式，您应该能够做您想做的事情。祝你好运，如果可以的话，我会提供更多帮助。

score 0 · Accepted Answer

由于 h5py 数组与 numpy 数组密切相关，因此您可以使用 numpy.min 和 numpy.max 函数来执行此操作：

maxItem = numpy.max(data['U'][:]) # Find the max of item 'U'
minItem = numpy.min(data['H'][:]) # Find the min of item 'H'

注意 ':'，需要将数据转换为 numpy 数组。

score 0 · Accepted Answer

您可以在 DataFrame 上调用minand （逐行）：max

In [1]: df = pd.DataFrame([[1, 6], [5, 2], [4, 3]], columns=list('UT'))

In [2]: df
Out[2]: 
   U  T
0  1  6
1  5  2
2  4  3

In [3]: df.min(0)
Out[3]: 
U    1
T    2

In [4]: df.max(0)
Out[4]: 
U    5
T    6

score 0 · Accepted Answer

你的意思是data.attrs而不是data它本身？如果是这样的话，

import h5py

with h5py.File("myfile.h5", "w") as the_file:
    dset = the_file.create_dataset('MyDataset', (100, 100), 'i')
    dset.attrs['U'] = (0,1,2,3)
    dset.attrs['T'] = (2,3,4,5)    

with h5py.File("myfile.h5", "r") as the_file:
    data = the_file["MyDataset"]
    print({key:(min(value), max(value)) for key, value in data.attrs.items()})

产量

{u'U': (0, 3), u'T': (2, 5)}

python - 使用 h5py 访问数据范围

4 回答 4

Related

Reference