The style of iterating that you're using generally works well with the doMC
backend because the workers can effectively share tt
by the magic of fork
. But with doSNOW
, tt
will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by @Beasterfield to iterate directly over tt
resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.
In cases like this, I use the isplitVector
function from the itertools
package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW
, it will put these sub-vectors into a list in order to call the clusterApplyLB
function in snow
since clusterApplyLB
doesn't support iterators. However, the doMPI
and doRedis
backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.
Here's a complete example using doMPI
:
suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()
When I run this on my MacBook Pro with 4 GB of memory
$ time mpirun -n 5 R --slave -f split.R
it takes about 16 seconds.
You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize
may allow you to start more.
You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize)
.