0

问题:

我正在尝试对大型数据集执行相关性测试:data.table可以存在于内存中,但对其进行操作时会遇到内存限制Hmisc::rcorr()corrr::correlate()最终会遇到内存限制。

> Error: cannot allocate vector of size 1.1 Gb

所以,我转而使用 filebackeddisk.frame包来解决这个问题,但我仍然达到了内存限制。

任何关于如何使用disk.frame或处理大内存的包来实现这一点的建议都非常感谢。

两者都rcorr()correlate()整个数据集进行操作。数据集包含NA值,因此我需要使用这些函数,因为它们允许使用"pairwise.complete.obs".

尝试:

# Packages ----
library(corrr)
library(Hmisc)
library(disk.frame)
library(data.table)


# Initialise parallel processing backend
setup_disk.frame()

# Enable large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)


# test_DT is a data.table of ~18000 columns and ~800 rows
# of type `num` (`double`) 


# Create filebacked disk.frame ----
test_DT_df <- as.disk.frame(
  test_DT, 
  outdir = file.path(tempdir(), "test_tmp.df"),
  nchunks = recommend_nchunks(test_DT, conservatism = 4),
  overwrite = TRUE
)


# `Hmisc` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
  cmap(
    .x = test_DT_df,
    .f = function(.x) {
      Hmisc::rcorr(
        x = as.matrix(.x),
        type = "pearson"
      )
    }
  ),
  overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)


# `corrr` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
  cmap(
    .x = test_DT_df,
    .f = function(.x) {
      corrr::correlate(
        x = .x,
        use = "pairwise.complete.obs",
        method = "pearson"
      )
    }
  ),
  overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)


# Cleanup ----
delete(test_DT_df)
delete(test_cor)
rm(test_DT_df, test_cor, test_cor_collect)
gc()
4

1 回答 1

1

解释我的评论的答案“然后您可以遍历所有成对变量并将结果存储在磁盘矩阵中。”:

res <- bigstatsr::FBM(4, 4)
for (j in seq_len(4)) {
  for (i in seq_len(j - 1)) {
    corr <- Hmisc::rcorr(iris[[j]], iris[[i]])
    res[i, j] <- res[j, i] <- corr$r[1, 2]
  }
  res[j, j] <- 1
} 
res[]
于 2021-11-14T21:31:31.553 回答