python - Performance issue with loop on datasets with h5py

Question

I want to apply a simple function to the datasets contained in an hdf5 file. I am using a code similar to this

import h5py
data_sums = []

with h5py.File(input_file, "r") as f:
    for (name, data) in f["group"].iteritems():
        print name
        # data_sums.append(data.sum(1))
        data[()]  # My goal is similar to the line above but this line is enough
                  # to replicate the problem

It goes very fast at the beginning and after a certain number, reproducible to some extent, of datasets it slow down dramatically. If I comment the last line, it finishes almost instantly. It does not matter if the data are stored (here append to a list) or not: something like data[:100] as a similar effect. The number of datasets that can be treated before the drop in performance is dependent to the size of portion that is accessed at each iteration. Iterating over smaller chunks does not solve the issue.

I suppose I am filling some memory space and that the process slows down when it is full but I do not understand why.

How to circumvent this performance issue?

I run python 2.6.5 on ubuntu 10.04.

Edit: The following code does not slow down if the second line of the loop is un-commented. It does slow down without out it

f = h5py.File(path to file, "r")
list_name = f["data"].keys()
f.close()

import numpy as np

for name in list_name:
    f = h5py.File(d.storage_path, "r")
    # name = list_name[0] # with this line the issue vanishes.
    data = f["data"][name]
    tag = get_tag(name)
    data[:, 1].sum()
    print "."

    f.close()

Edit: I found out that accessing the first dimension of multidimensional datasets seems to run without issues. The problem occurs when higher dimensions are involved.

score 1 · Accepted Answer

平台？

在 Windows 64 位 python 2.6.6 上，如果您将其分配为小块，我在跨越 2GB 障碍（我认为）时会看到一些奇怪的问题。

你可以用这样的脚本看到它：

ix = []
for i in xrange(20000000):
    if i % 100000 == 0:
        print i
    ix.append('*' * 1000)

你可以看到它会在一段时间内运行得相当快，然后突然减速。

但是如果你在更大的块中运行它：

ix = []
for i in xrange(20000):
    if i % 100000 == 0:
        print i
    ix.append('*' * 1000000)

它似乎没有问题（尽管它会耗尽内存，具体取决于你有多少 - 这里是 8GB）。

更奇怪的是，如果你用大块吃掉内存，然后清空内存（ix=[] 再次，所以回到几乎没有内存在使用），然后重新运行小块测试，它不再慢了.

我认为对 pyreadline 版本有一些依赖——2.0-dev1 对这些问题有很大帮助。但不要记得太多。当我现在尝试它时，我真的不再看到这个问题 - 两者都在 4.8GB 左右显着减慢，这与我运行的所有其他东西有关，它达到了物理内存的限制并开始交换。

python - Performance issue with loop on datasets with h5py

1 回答 1

Related

Reference