r - 如何避免使用 foreach 复制对象

Question

我有一个非常大的字符串向量，想使用foreach和dosnow打包进行并行计算。我注意到这foreach会为每个进程复制向量，从而迅速耗尽系统内存。我试图将向量分解为列表对象中的较小部分，但仍然没有看到任何内存使用减少。有人对此有想法吗？下面是一些演示代码：

library(foreach)
library(doSNOW)
library(snow)

x<-rep('some string', 200000000)
# split x into smaller pieces in a list object
splits<-getsplits(x, mode='bysize', size=1000000) 
tt<-vector('list', length(splits$start))  
for (i in 1:length(tt)) tt[[i]]<-x[splits$start[i]: splits$end[i]]

ret<-foreach(i = 1:length(splits$start), .export=c('somefun'), .combine=c)   %dopar% somefun(tt[[i]])

score 5 · Accepted Answer

The style of iterating that you're using generally works well with the doMC backend because the workers can effectively share tt by the magic of fork. But with doSNOW, tt will be auto-exported to the workers, using lots of memory even though they only actually need a fraction of it. The suggestion made by @Beasterfield to iterate directly over tt resolves that issue, but it's possible to be even more memory efficient through the use of iterators and an appropriate parallel backend.

In cases like this, I use the isplitVector function from the itertools package. It splits a vector into a sequence of sub-vectors, allowing them to be processed in parallel without losing the benefits of vectorization. Unfortunately, with doSNOW, it will put these sub-vectors into a list in order to call the clusterApplyLB function in snow since clusterApplyLB doesn't support iterators. However, the doMPI and doRedis backends will not do that. They will send the sub-vectors to the workers right from the iterator, using almost half as much memory.

Here's a complete example using doMPI:

suppressMessages(library(doMPI))
library(itertools)
cl <- startMPIcluster()
registerDoMPI(cl)
n <- 20000000
chunkSize <- 1000000
x <- rep('some string', n)
somefun <- function(s) toupper(s)
ret <- foreach(s=isplitVector(x, chunkSize=chunkSize), .combine='c') %dopar% {
  somefun(s)
}
print(length(ret))
closeCluster(cl)
mpi.quit()

When I run this on my MacBook Pro with 4 GB of memory

$ time mpirun -n 5 R --slave -f split.R

it takes about 16 seconds.

You have to be careful with the number of workers that you create on the same machine, although decreasing the value of chunkSize may allow you to start more.

You can decrease your memory usage even more if you're able to use an iterator that doesn't require all of the strings to be in memory at the same time. For example, if the strings are in a file named 'strings.txt', you can use s=ireadLines('strings.txt', n=chunkSize).

r - 如何避免使用 foreach 复制对象

1 回答 1

Related

Reference