Problem description
In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.
The number of particles (n_particles
) is 10-100. The number of time-steps (time_size
) is ~10^9.
The simulation has 3 steps (the code below is for an all-in-RAM version):
Simulate (and store) an
emission
rate array (contains many almost-0 elements):- shape (
n_particles
xtime_size
), float32, size 80GB
- shape (
Compute
counts
array, (random values from a Poisson process with previously computed rates):shape (
n_particles
xtime_size
), uint8, size 20GBcounts = np.random.poisson(lam=emission).astype(np.uint8)
Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.
# Loop across the particles timestamps = [np.nonzero(c) for c in counts]
I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission
(apply cumsum
or other functions) before computing counts
.
Question
I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.
What I would like it exist
I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).
Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max
, cumsum
, ...) to work transparently, requiring only a chunksize
keyword to specify how much of the array to load at each iteration.
Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.
What are the practical options
As a first option I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.
So my options for which I seek an answer are
implement a numpy array with required functionality (how?)
use pytable in a smarter way (different data-structures/methods)
use another library: h5py, blaze, pandas... (I haven't tried any of them so far).
Tentative solution (pyTables)
I save the simulation parameters in '/parameters'
group: each parameter is converted to a numpy array scalar. Verbose solution but it works.
I save emission
as an Extensible array (EArray
), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts
is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2')
. I'm not sure this is space-efficient, and
generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):
dt = np.dtype([('counts', 'u1')])
for ip in xrange(n_particles):
name = "particle_%d" % ip
data_file.create_table(
group, name, description=dt, chunkshape=chunksize,
expectedrows=time_size,
title='Binned timetrace of emitted ph (bin = t_step)'
' - particle_%d' % particle)
Each particle-counts "table" has a different name (name = "particle_%d" % ip
) and that I need to put them in a python list for easy iteration.
EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.