我对R
. 在我看来,当与maindoMC
结合使用时,它“似乎”是分叉的,(实际上它似乎是d 根据)你可以检查发生了什么foreach::dopar
R
clone
strace
strace -ff -t -T -o log_file -p PID_of_the_R_process
看起来
- 每次调用 foreach 时都会分叉新进程!
- 当主进程中的对象没有被修改时,内存可能不会被复制,但是一旦子进程返回一些不重要的东西,子进程就会发生一些事情,这会花费很多时间
这是一个例子:
# create a 2G table
library(data.table)
library(anytime)
dates <- seq.Date(from = anydate(20130101), to = anydate(20220101), by = 1L)
triplets <- apply(combn(x = LETTERS, m = 3), MARGIN = 2L, FUN = paste, collapse = "")
DT <- data.table(CJ(date = dates, id = triplets))
set.seed(1L)
for(i in 1:40) {
DT[, sprintf("V%s", i) := rnorm(.N)]
}
tables()
# NAME NROW NCOL MB COLS KEY
#1: DT 8,548,800 42 2,739 date,id,V1,V2,V3,V4,... date,id
#Total: 2,739MB
library(doMC)
library(foreach)
Sys.getpid()
# [1] 2915
registerDoMC(4)
# test of doing "nothing"
system.time({
res <-
foreach(i = 1:4) %dopar% {
NULL
}
})
# that's quite slow already but we should take this as a baseline
# user system elapsed
# 0.000 0.458 0.493
# doing a single operation
system.time({
res <-
foreach(i = 1:1) %dopar% {
DT[, lapply(.SD, FUN = sum), .SDcols = 3:10, by = sign(V1)]
}
})
# user system elapsed
# 2.081 0.000 1.178
system.time({
res_list <-
foreach(i = 1:4) %dopar% {
DT[, lapply(.SD, FUN = sum), .SDcols = 3:30, by = sign(V1)]
}
})
# user system elapsed
# 3.020 1.875 4.518
system.time({
res_list <-
foreach(i = 1:4) %dopar% {
DT[, lapply(.SD, FUN = sum), .SDcols = 3:30, by = sign(V1)]
DT[, test := 1L] # modification
}
})
# takes ages
问题是:
- 有没有办法不一直重新创建进程?
- 当小数据从工人传回主人时,需要几秒钟的时间正在做什么?
- 是否有现成的高性能解决方案可以使用线程而不是分叉来实现并行性?