2

I need to generate several million arrays of random numbers in a replicable manner. The arrays will be generated on a cluster of machines connected with OpenMPI. Each 'task' requires an independent array containing several thousand random integers.

I want to be able to perform multiple runs such that any given run can be replicated. The code is currently in R, but I'm more interested in the general principles of parallel PRNG generation across multiple machines than in the OpenMPI or R specifics.

In theory I could generate all the random series on the 'master' and then send the entire series to the 'slave' along with the task, but this feels unwieldy. Instead, I'd like to have the random series generated on the slave after it receives the task.

Currently I provide a single random number seed on the command line, which the master uses to generate a series of seeds for the slaves. Each task is assigned a seed taken sequentially from this stream. This seed is sent to the slave along with the task details.

Master:

srand(commandLineArg)
runParameters = stuff
for (1 .. numTasks) {
    slaveSeed =  rand()
    scheduleTask(slaveSeed, runParameters)
}

Slave(s):

srand(slaveSeed) 
for n (1 .. numPoints) {
    data[n] = rand()
}
return doStuff(data)
  1. Is this a safe approach? Are the million series generated by reading a thousand random numbers from a million seeded streams as independent and random as sequentially reading a million series of a thousand numbers from a single seeded stream?

  2. Is it necessary to have the master generate a series of random slave seeds, or would it be equally effective to use a simple series(1..numTasks) for the slaves? I'd rather not add a false sense of security if this step is just a charade.

  3. Is there an established best practice for reproducibly generating pseudo-random samples in this manner? I've seen reference to SPRNG and L'Ecuyer approaches. Do these have benefits over the method I've described?

Thanks!

4

0 回答 0