numpy - 在 h5py 中用相同的复合数据值填充数据集的快速方法

Question

我在 hdf 文件中有大量的复合数据数据集。复合数据的类型如下所示：

    numpy.dtype([('Image', h5py.special_dtype(ref=h5py.Reference)), 
                 ('NextLevel', h5py.special_dtype(ref=h5py.Reference))])

有了它，我创建了一个数据集，其中引用了每个位置的图像和另一个数据集。这些数据集的维度为 nxn，n 通常至少为 256，但更有可能 > 2000。我必须最初用相同的值填充这些数据集的每个位置：

    [[(image.ref, dataset.ref)...(image.ref, dataset.ref)],
      .
      .
      .
     [(image.ref, dataset.ref)...(image.ref, dataset.ref)]]

我尽量避免用两个 for 循环填充它，例如：

    for i in xrange(0,n):
      for j in xrange(0,n):
         daset[i,j] =(image.ref, dataset.ref)

因为性能很差。所以我正在寻找类似numpy.fill, numpy.shape, numpy.reshape, numpy.array,numpy.arrange等的东西[:]。我以各种方式尝试了这些函数，但它们似乎都只适用于数字和字符串数据类型。有什么方法可以比 for 循环更快地填充这些数据集？

先感谢您。

score 0 · Accepted Answer

您可以使用 numpy广播或numpy.repeatand的组合numpy.reshape：

my_dtype = numpy.dtype([('Image', h5py.special_dtype(ref=h5py.Reference)), 
             ('NextLevel', h5py.special_dtype(ref=h5py.Reference))])
ref_array = array( (image.ref, dataset.ref), dtype=my_dtype)
dataset = numpy.repeat(ref_array, n*n)
dataset = dataset.reshape( (n,n) )

请注意，它numpy.repeat返回一个扁平数组，因此使用numpy.reshape. 它似乎repeat比仅仅广播它更快：

%timeit empty_dataset=np.empty(2*2,dtype=my_dtype); empty_dataset[:]=ref_array
100000 loops, best of 3: 9.09 us per loop

%timeit repeat_dataset=np.repeat(ref_array, 2*2).reshape((2,2))
100000 loops, best of 3: 5.92 us per loop

numpy - 在 h5py 中用相同的复合数据值填充数据集的快速方法

1 回答 1

Related

Reference