Currently I am working on quite huge dataset which barely fits into my memory, so I use np.memmap
. But at some point I have to split my dataset into training and test.
I have found such case when I want to slice np.memmap
using some index array:
(Below you can find code and mem allocations)
Line # Mem usage Increment Line Contents
================================================
7 29.340 MB 0.000 MB def my_func2():
8 29.340 MB 0.000 MB ARR_SIZE = (1221508/4,430)
9 29.379 MB 0.039 MB big_mmap = np.memmap('big_mem_test.mmap',shape=ARR_SIZE, dtype=np.float64, mode='r')
10 38.836 MB 9.457 MB idx = range(ARR_SIZE[0])
11 2042.605 MB 2003.770 MB sub = big_mmap[idx,:]
12 3046.766 MB 1004.160 MB sub2 = big_mmap[idx,:]
13 3046.766 MB 0.000 MB return type(sub)
But if I like to take continous slice I would use rather this code:
Line # Mem usage Increment Line Contents
================================================
15 29.336 MB 0.000 MB def my_func3():
16 29.336 MB 0.000 MB ARR_SIZE = (1221508/4,430)
17 29.375 MB 0.039 MB big_mmap = np.memmap('big_mem_test.mmap',shape=ARR_SIZE, dtype=np.float64, mode='r')
18 29.457 MB 0.082 MB sub = big_mmap[0:1221508/4,:]
19 29.457 MB 0.000 MB sub2 = big_mmap[0:1221508/4,:]
Notice that in second example in lines 18,19 there is no memory allocation and whole operation is a lot faster.
In first example in line 11 there is alocation so whole big_mmap
matrix is readed during slicing. But what is more suprising in line 12 there is another alocation. Doing more such operation you can easily run out of memory.
When I split my data set indexes are rather random and not continous so I cannot use big_mmap[start:end,:]
notation.
My question is:
Is there any other method which allow me to slice memmap without reading whole data to memory?
Why whole matrix is readed to memory when slicing with index (example one)?
Why data is readed and alocated again (first example line 12)?