I want to apply a simple function to the datasets contained in an hdf5 file. I am using a code similar to this
import h5py
data_sums = []
with h5py.File(input_file, "r") as f:
for (name, data) in f["group"].iteritems():
print name
# data_sums.append(data.sum(1))
data[()] # My goal is similar to the line above but this line is enough
# to replicate the problem
It goes very fast at the beginning and after a certain number, reproducible to some extent, of datasets it slow down dramatically. If I comment the last line, it finishes almost instantly. It does not matter if the data are stored (here append to a list) or not: something like data[:100] as a similar effect. The number of datasets that can be treated before the drop in performance is dependent to the size of portion that is accessed at each iteration. Iterating over smaller chunks does not solve the issue.
I suppose I am filling some memory space and that the process slows down when it is full but I do not understand why.
How to circumvent this performance issue?
I run python 2.6.5 on ubuntu 10.04.
Edit: The following code does not slow down if the second line of the loop is un-commented. It does slow down without out it
f = h5py.File(path to file, "r")
list_name = f["data"].keys()
f.close()
import numpy as np
for name in list_name:
f = h5py.File(d.storage_path, "r")
# name = list_name[0] # with this line the issue vanishes.
data = f["data"][name]
tag = get_tag(name)
data[:, 1].sum()
print "."
f.close()
Edit: I found out that accessing the first dimension of multidimensional datasets seems to run without issues. The problem occurs when higher dimensions are involved.