2

When selecting from an hdf5 file in chunks, I would like to know how many chunks there are in the resulting selection.

Number of rows in the input data nrows can be up to 100mln, chunksize is 100k, but for most selections the number of rows in a chunk nrows_chunk is smaller, so for different where I can have selections with one or many chunks. Before doing operations with chunks and at the time of calling iteratorGenerator() I would like to know how many chunks there will be. Intuitively, I want something like len(list(enumerate(iteratorGenerator()))) in my syntax, but this would give length=1 (I suppose because only one chunk at a time is considered by iteratorGenerator()).

I suspected there is no solution to this issue as the whole idea of using generator is not to perform all selections at once but do it chunk by chunk. But actually, when I run the for loop below, the very first iteration takes really long, but the following iterations take just a seconds, which suggests that on the first iteration most of the data about chunks is collected. This is puzzling to me and I would appreciate any explanation on how selection by chunks works.

iteratorGenerator = lambda: inputStore.select(
                groupInInputStore,
                where=where,
                columns=columns,
                iterator=True,
                chunksize=args.chunksize
            )

nrows = inputStore.get_storer(groupInInputStore).nrows

# if there is more than one chunk in the selection:
for i, chunk in enumerate(iteratorGenerator()):
    # check the size of a chunk 
    nrows_chunk = len(chunk)
    # do stuff with chunks, mainly groupby operations

# if there is only one chunk do other stuff 

Moreover, I am not sure what the chunksize in HDFStore.select refers to. From my experience, it is a maximal size of the selected chunk after applying where condition. On the other hand, http://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.select.html defines chunksize: nrows to include in iteration, which to me sounds like the number of rows to read from. Which is correct?

4

0 回答 0