r - 在 R 中使用 big.matrix 进行交叉验证的并行计算

Question

我正在尝试使用mclapply并行化交叉验证来为非常大的设计矩阵X（~10GB）和响应向量建模拟合过程y。假设X是维度n-by-p: n=1000, p=1,000,000。由于X非常庞大，它作为big.matrix对象支持，存储在磁盘中，并使用 R 包中的方法进行访问bigmemory。

4 折交叉验证的工作流程如下。

设置cv.ind长度为的索引向量n，它存储从 1 到 4 的数字序列，指示哪个观察值X属于 CV 的哪个折叠。
设置4个核心。在第 i 个核心中，为第 i 个折叠 CV 复制相应的训练和测试子矩阵。
为每个核心中的每个折叠拟合一个模型。
收集结果，返回。

交叉验证函数如下所示。

cv.ncvreg <- function(X, y, ncore, nfolds=5, seed, cv.dir = getwd(), 
                      cv.ind) {
  ## some more setup ...
  ## ...
  ## ...
  ## pass the descriptor info to each core ##
  xdesc <- describe(X)
  ## use mclapply instead of parLapply
  fold.results <- parallel::mclapply(X = 1:nfolds, FUN = cvf, XX=xdesc, y=y, 
                                     cv.dir = cv.dir, cv.ind=cv.ind, 
                                     cv.args=cv.args,
                                     mc.set.seed = seed, mc.silent = F, 
                                     mc.cores = ncore, mc.preschedule = F)

  ## return results
}

R 函数cvf在每个内核中运行。它将第 i 次折叠的训练/测试矩阵复制为两个big.matrix对象，拟合模型，计算一些统计数据并返回结果。

cvf <- function(i, XX, y, cv.dir, cv.ind, cv.args) {
  ## argument 'XX' is the descriptor for big.matrix
  # reference to the big.matrix by descriptor info
  XX <- attach.big.matrix(XX)

  cat("CV fold #", i, "\t--Copy training-- Start time: ", format(Sys.time()), "\n\n")
  ## physically copy sub big.matrix for training
  idx.train <- which(cv.ind != i) ## get row idx for i-th fold training
  deepcopy(XX, rows = idx.train, type = 'double',
           backingfile = paste0('x.cv.train_', i, '.bin'),
           descriptorfile = paste0('x.cv.train_', i, '.desc'),
           backingpath = cv.dir)
  cv.args$X <- attach.big.matrix(paste0(cv.dir, 'x.cv.train_', i, '.desc'))
  cat("CV fold #", i, "\t--Copy training-- End time: ", format(Sys.time()), "\n\n")

  cat("CV fold #", i, "\t--Copy test-- Start time: ", format(Sys.time()), "\n\n")
  ## physically copy remaining part of big.matrix for testing
  idx.test <- which(cv.ind == i) ## get row idx for i-th fold testing
  deepcopy(XX, rows = idx.test, type = 'double',
           backingfile = paste0('x.cv.test_', i, '.bin'),
           descriptorfile = paste0('x.cv.test_', i, '.desc'),
           backingpath = cv.dir)
  X2 <- attach.big.matrix(paste0(cv.dir, 'x.cv.test_', i, '.desc'))
  cat("CV fold #", i, "\t--Copy test-- End time: ", format(Sys.time()), "\n\n")

  # cv.args$X <- XX[cv.ind!=i, , drop=FALSE]
  cv.args$y <- y[cv.ind!=i]
  cv.args$warn <- FALSE

  cat("CV fold #", i, "\t--Fit ncvreg-- Start time: ", format(Sys.time()), "\n\n")
  ## call 'ncvreg' function, fit penalized regression model
  fit.i <- ncvreg(X=cv.args$X, y=cv.args$y, family=cv.args$family, 
                  penalty = cv.args$penalty,lambda = cv.args$lambda, convex = cv.args$convex)

  # fit.i <- do.call("ncvreg", cv.args)
  cat("CV fold #", i, "\t--Fit ncvreg-- End time: ", format(Sys.time()), "\n\n")

  y2 <- y[cv.ind==i]
  yhat <- matrix(predict(fit.i, X2, type="response"), length(y2))

  loss <- loss.ncvreg(y2, yhat, fit.i$family)
  pe <- if (fit.i$family=="binomial") {(yhat < 0.5) == y2} else NULL
  list(loss=loss, pe=pe, nl=length(fit.i$lambda), yhat=yhat)
}

到目前为止，当设计矩阵X不是太大时，代码工作得很好，比如n=1000, p=100,000大小约为 1GB。但是，如果p=1,000,000因此大小X变为 ~10GB，则每个内核中的模型拟合过程将永远运行！！！！！！（以下部分）：

#...
cat("CV fold #", i, "\t--Fit ncvreg-- Start time: ", format(Sys.time()), "\n\n")
## call 'ncvreg' function, fit penalized regression model
fit.i <- ncvreg(X=cv.args$X, y=cv.args$y, family=cv.args$family, 
                penalty = cv.args$penalty,lambda = cv.args$lambda, convex = cv.args$convex)
# fit.i <- do.call("ncvreg", cv.args)
cat("CV fold #", i, "\t--Fit ncvreg-- End time: ", format(Sys.time()), "\n\n")
#...

备注：

如果我在原始矩阵 (10GB) 上运行一次“ncvreg()”，大约需要 2.5 分钟。
for如果我使用loop 但 not顺序运行交叉验证mclapply，则代码运行良好，每个折叠 'ncvreg()' 的模型拟合也很好（约 2 分钟），尽管整个过程需要约 25 分钟。
起初我尝试了同样的问题“parLapply”。由于这里的原因，我切换到“mclapply” 。
每个核心中的数据复制步骤（即deepcopy部分）运行良好，大约需要 2 分钟才能将训练和测试数据集复制并备份到磁盘上。
我试图监控 CPU 使用率，下面是一张截图。如我们所见，在左图中，4 个 rsession 中的每一个都占用了 ~25% 的 CPU 使用率，而有一个进程时，kernel_task占用了 ~100% 的 CPU。随着时间的推移，kernel_task甚至会占用 150% 的 CPU。此外，CPU 历史面板（右下角）显示大部分 CPU 使用率来自系统，而不是用户，因为红色区域主导绿色区域。

我的问题：

为什么模型拟合过程在并行时需要很长时间？可能的原因是什么？
我是否在正确的轨道上并行化 big.matrix 的 CV 程序？有什么替代方法吗？

我感谢任何有助于解决我的问题的见解。提前致谢！！

r - 在 R 中使用 big.matrix 进行交叉验证的并行计算

0 回答 0

Related

Reference