python - 用 Python 编写 HDF5 文件的最快方法是什么？

Question

给定一个包含混合文本/数字的大型（10 GB）CSV 文件，在保持内存使用合理的同时，创建具有相同内容的 HDF5 文件的最快方法是什么？

如果可能的话，我想使用该h5py模块。

在下面的玩具示例中，我发现了一种将数据写入 HDF5 的非常慢且非常快的方法。以 10,000 行左右的数据块写入 HDF5 是否是最佳实践？或者有没有更好的方法将大量数据写入这样的文件？

import h5py

n = 10000000
f = h5py.File('foo.h5','w')
dset = f.create_dataset('int',(n,),'i')

# this is terribly slow
for i in xrange(n):
  dset[i] = i

# instantaneous
dset[...] = 42

score 8 · Accepted Answer

我会避免对数据进行分块，并将数据存储为一系列单数组数据集（按照 Benjamin 的建议）。我刚刚完成将我一直在开发的企业应用程序的输出加载到 HDF5 中，并且能够将大约 45 亿个复合数据类型打包为 450,000 个数据集，每个数据集包含 10,000 个数据数组。写入和读取现在看起来相当即时，但当我最初尝试对数据进行分块时，速度非常慢。

只是一个想法！

更新：

这些是从我的实际代码中提取的几个片段（我使用 C 与 Python 进行编码，但您应该了解我在做什么）并为清晰起见进行了修改。我只是在数组中写入长无符号整数（每个数组 10,000 个值）并在需要实际值时将它们读回

这是我典型的作家代码。在这种情况下，我只是将长无符号整数序列写入数组序列，并在创建每个数组序列时将它们加载到 hdf5 中。

//Our dummy data: a rolling count of long unsigned integers
long unsigned int k = 0UL;
//We'll use this to store our dummy data, 10,000 at a time
long unsigned int kValues[NUMPERDATASET];
//Create the SS adata files.
hid_t ssdb = H5Fcreate(SSHDF, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
//NUMPERDATASET = 10,000, so we get a 1 x 10,000 array
hsize_t dsDim[1] = {NUMPERDATASET};
//Create the data space.
hid_t dSpace = H5Screate_simple(1, dsDim, NULL);
//NUMDATASETS = MAXSSVALUE / NUMPERDATASET, where MAXSSVALUE = 4,500,000,000
for (unsigned long int i = 0UL; i < NUMDATASETS; i++){
    for (unsigned long int j = 0UL; j < NUMPERDATASET; j++){
        kValues[j] = k;
        k += 1UL;
    }
    //Create the data set.
    dssSet = H5Dcreate2(ssdb, g_strdup_printf("%lu", i), H5T_NATIVE_ULONG, dSpace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    //Write data to the data set.
    H5Dwrite(dssSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, kValues);
    //Close the data set.
    H5Dclose(dssSet);
}
//Release the data space
H5Sclose(dSpace);
//Close the data files.
H5Fclose(ssdb);

这是我的阅读器代码的略微修改版本。有更优雅的方法可以做到这一点（即，我可以使用超平面来获得价值），但就我相当有纪律的敏捷/BDD 开发过程而言，这是最干净的解决方案。

unsigned long int getValueByIndex(unsigned long int nnValue){
    //NUMPERDATASET = 10,000
    unsigned long int ssValue[NUMPERDATASET];
    //MAXSSVALUE = 4,500,000,000; i takes the smaller value of MAXSSVALUE or nnValue
    //to avoid index out of range error 
    unsigned long int i = MIN(MAXSSVALUE-1,nnValue);
    //Open the data file in read-write mode.
    hid_t db = H5Fopen(_indexFilePath, H5F_ACC_RDONLY, H5P_DEFAULT);
    //Create the data set. In this case, each dataset consists of a array of 10,000
    //unsigned long int and is named according to its integer division value of i divided
    //by the number per data set.
    hid_t dSet = H5Dopen(db, g_strdup_printf("%lu", i / NUMPERDATASET), H5P_DEFAULT);
    //Read the data set array.
    H5Dread(dSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, ssValue);
    //Close the data set.
    H5Dclose(dSet);
    //Close the data file.
    H5Fclose(db);
    //Return the indexed value by using the modulus of i divided by the number per dataset
    return ssValue[i % NUMPERDATASET];
}

主要内容是编写代码中的内部循环以及整数除法和 mod 操作，以获取数据集数组的索引和该数组中所需值的索引。让我知道这是否足够清楚，以便您可以在 h5py 中组合类似或更好的东西。在 C 语言中，这非常简单，与分块数据集解决方案相比，它给了我更好的读/写时间。另外，由于我无论如何都不能对复合数据集使用压缩，因此分块的明显优势是一个有争议的问题，因此我所有的复合数据都以相同的方式存储。

score 5 · Accepted Answer

使用的灵活性numpy.loadtxt将文件中的数据放入 anumpy array中，这反过来又非常适合初始化hdf5数据集。

import h5py
import numpy as np

d = np.loadtxt('data.txt')
h = h5py.File('data.hdf5', 'w')
dset = h.create_dataset('data', data=d)

score 3 · Accepted Answer

I'm not sure if this is the most efficient way (and I've never used it; I'm just pulling together some tools I've used independently), but you could read the csv file into a numpy recarray using the matplotlib helper methods for csv.

You can probably find a way to read the csv files in chunks as well to avoid loading the whole thing to disk. Then use the recarray (or slices therein) to write the whole (or large chunks of it) to the h5py dataset. I'm not exactly sure how h5py handles recarrays, but the documentation indicates that it should be ok.

Basically if possible, try to write big chunks of data at once instead of iterating over individual elements.

Another possibility for reading the csv file is just numpy.genfromtxt

You can grab the columns you want using the keyword usecols, and then only read in a specified set of lines by properly setting the skip_header and skip_footer keywords.

python - 用 Python 编写 HDF5 文件的最快方法是什么？

3 回答 3

Related

Reference