0

我对R. 在我看来,当与maindoMC结合使用时,它“似乎”是分叉的,(实际上它似乎是d 根据)你可以检查发生了什么foreach::doparRclonestrace

strace -ff -t -T -o log_file -p PID_of_the_R_process

看起来

  • 每次调用 foreach 时都会分叉新进程!
  • 当主进程中的对象没有被修改时,内存可能不会被复制,但是一旦子进程返回一些不重要的东西,子进程就会发生一些事情,这会花费很多时间

这是一个例子:

# create a 2G table
library(data.table)                                                                                     
library(anytime)                                                                                        
dates <- seq.Date(from = anydate(20130101), to = anydate(20220101), by = 1L)                            
triplets <- apply(combn(x = LETTERS, m = 3), MARGIN = 2L, FUN = paste, collapse = "")                   
DT <- data.table(CJ(date = dates, id = triplets))                                                       
set.seed(1L)                                                                                            
for(i in 1:40) {                                                                                        
    DT[, sprintf("V%s", i) := rnorm(.N)]                                                                
}

tables()
#   NAME      NROW NCOL    MB                    COLS     KEY
#1:   DT 8,548,800   42 2,739 date,id,V1,V2,V3,V4,... date,id
#Total: 2,739MB

                                                                                    
library(doMC)                                                                                           
library(foreach)                                                                                        
Sys.getpid()                                                                                            
# [1] 2915                                                                                              
registerDoMC(4)                                                                                         

# test of doing "nothing"                                                                               
system.time({                       
    res <-                          
        foreach(i = 1:4) %dopar% {  
            NULL                    
        }                           
})                                  
# that's quite slow already but we should take this as a baseline
#  user  system elapsed 
#  0.000   0.458   0.493 

# doing a single operation
system.time({                                                          
    res <-                                                             
        foreach(i = 1:1) %dopar% {                                     
            DT[, lapply(.SD, FUN = sum), .SDcols = 3:10, by = sign(V1)]
        }                                                              
})                                                                     
#  user  system elapsed 
#  2.081   0.000   1.178 
                                                                                                        
system.time({                                                                                           
    res_list <-                                                                                         
        foreach(i = 1:4) %dopar% {                                                                      
            DT[, lapply(.SD, FUN = sum), .SDcols = 3:30, by = sign(V1)]                                 
        }                                                                                               
})                                                                                                      
#   user  system elapsed 
#  3.020   1.875   4.518 
                                                                                             
system.time({                                                           
    res_list <-                                                         
        foreach(i = 1:4) %dopar% {                                      
            DT[, lapply(.SD, FUN = sum), .SDcols = 3:30, by = sign(V1)] 
            DT[, test := 1L] # modification                                           
        }                                                               
})                                                                                                                          
# takes ages

问题是:

  • 有没有办法不一直重新创建进程?
  • 当小数据从工人传回主人时,需要几秒钟的时间正在做什么?
  • 是否有现成的高性能解决方案可以使用线程而不是分叉来实现并行性?
4

0 回答 0