9

Currently I am working on quite huge dataset which barely fits into my memory, so I use np.memmap. But at some point I have to split my dataset into training and test. I have found such case when I want to slice np.memmap using some index array: (Below you can find code and mem allocations)

Line #    Mem usage    Increment   Line Contents
================================================
 7    29.340 MB     0.000 MB   def my_func2():
 8    29.340 MB     0.000 MB       ARR_SIZE = (1221508/4,430)
 9    29.379 MB     0.039 MB       big_mmap = np.memmap('big_mem_test.mmap',shape=ARR_SIZE, dtype=np.float64, mode='r')    
10    38.836 MB     9.457 MB       idx = range(ARR_SIZE[0])
11  2042.605 MB  2003.770 MB       sub = big_mmap[idx,:]
12  3046.766 MB  1004.160 MB       sub2 = big_mmap[idx,:]
13  3046.766 MB     0.000 MB       return  type(sub)

But if I like to take continous slice I would use rather this code:

Line #    Mem usage    Increment   Line Contents
================================================
15    29.336 MB     0.000 MB   def my_func3():
16    29.336 MB     0.000 MB       ARR_SIZE = (1221508/4,430)
17    29.375 MB     0.039 MB       big_mmap = np.memmap('big_mem_test.mmap',shape=ARR_SIZE, dtype=np.float64, mode='r')    
18    29.457 MB     0.082 MB       sub = big_mmap[0:1221508/4,:]
19    29.457 MB     0.000 MB       sub2 = big_mmap[0:1221508/4,:]  

Notice that in second example in lines 18,19 there is no memory allocation and whole operation is a lot faster.

In first example in line 11 there is alocation so whole big_mmap matrix is readed during slicing. But what is more suprising in line 12 there is another alocation. Doing more such operation you can easily run out of memory.

When I split my data set indexes are rather random and not continous so I cannot use big_mmap[start:end,:] notation.

My question is:

Is there any other method which allow me to slice memmap without reading whole data to memory?

Why whole matrix is readed to memory when slicing with index (example one)?

Why data is readed and alocated again (first example line 12)?

4

1 回答 1

6

The double-allocation you are seeing in your first example isn't due to memmap behaviour; rather, it is due to how __getitem__ is implemented for numpy's ndarray class. When an ndarray is indexed using a list (as in your first example), data are copied from the source array. When it is indexed using a slice object, a view is created into the source array (no data are copied). For example:

In [2]: x = np.arange(16).reshape((4,4))

In [3]: x
Out[3]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [4]: y = x[[0, 2], :]

In [5]: y[:, :] = 100

In [6]: x
Out[6]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

y is a copy of data from x so changing y had no effect on x. Now index the array via slicing:

In [7]: z = x[::2, :]

In [8]: z[:, :] = 100

In [9]: x
Out[9]: 
array([[100, 100, 100, 100],
       [  4,   5,   6,   7],
       [100, 100, 100, 100],
       [ 12,  13,  14,  15]])

Regarding your first question, I'm not aware of a method that will allow you to create arbitrary slices that include with entire array without reading the entire array into memory. Two options you might consider (in addition to something like HDF5/PyTables, which you already discussed):

  1. If you are accessing elements of you training & test sets sequentially (rather than operating on them as two entire arrays), you could easily write a small wrapper class whose __getitem__ method uses your index arrays to pull the appropriate sample from the memmap (i.e., training[i] returns big_mmap[training_ids[i]])

  2. Split your array into two separate files, which contain exclusively training or test values. Then you could use two separate memmap objects.

于 2013-09-04T17:51:29.260 回答