r - R bigmemory 总是使用支持文件？

Question

我们正在尝试使用带有 foreach 的 BigMemory 库来并行我们的分析。但是， as.big.matrix 函数似乎总是使用 backingfile。我们的工作站有足够的内存，有没有办法在没有备份文件的情况下使用 bigMemory？

这段代码x.big.desc <-describe(as.big.matrix(x))非常慢，因为它将数据写入C:\ProgramData\boost_interprocess\. 不知何故，它比直接保存 x 慢，是 as.big.matrix 具有较慢的 I/O 吗？

这段代码x.big.desc <-describe(as.big.matrix(x, backingfile = ""))非常快，但是，它也会将数据的副本保存到 %TMP% 目录。我们认为它之所以快，是因为 R 启动了后台写入过程，而不是实际写入数据。（R提示返回后我们可以在TaskManager中看到写线程）。

有没有办法仅将 BigMemory 与 RAM 一起使用，以便 foreach 循环中的每个工作人员都可以通过 RAM 访问数据？

谢谢您的帮助。

score 0 · Accepted Answer

因此，如果您有足够的 RAM，只需使用标准 R 矩阵。要仅将每个矩阵的一部分传递给每个集群，请使用 rdsfiles。

一个计算colSums3 核的示例：

# Functions for splitting
CutBySize <- function(m, nb) {
  int <- m / nb

  upper <- round(1:nb * int)
  lower <- c(1, upper[-nb] + 1)
  size <- c(upper[1], diff(upper))

  cbind(lower, upper, size)
}
seq2 <- function(lims) seq(lims[1], lims[2])

# The matrix
bm <- matrix(1, 10e3, 1e3)
ncores <- 3
intervals <- CutBySize(ncol(bm), ncores)
# Save each part in a different file
tmpfile <- tempfile()
for (ic in seq_len(ncores)) {
  saveRDS(bm[, seq2(intervals[ic, ])], 
          paste0(tmpfile, ic, ".rds"))
}
# Parallel computation with reading one part at the beginning
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
library(foreach)
colsums <- foreach(ic = seq_len(ncores), .combine = 'c') %dopar% {
  bm.part <- readRDS(paste0(tmpfile, ic, ".rds"))
  colSums(bm.part)
}
parallel::stopCluster(cl)
# Checking results
all.equal(colsums, colSums(bm))

您甚至可以rm(bm); gc()在将部分写入磁盘后使用。

r - R bigmemory 总是使用支持文件？

1 回答 1

Related

Reference