问题:
我正在尝试对大型数据集执行相关性测试:data.table
可以存在于内存中,但对其进行操作时会遇到内存限制Hmisc::rcorr()
或corrr::correlate()
最终会遇到内存限制。
> Error: cannot allocate vector of size 1.1 Gb
所以,我转而使用 filebackeddisk.frame
包来解决这个问题,但我仍然达到了内存限制。
任何关于如何使用disk.frame
或处理大内存的包来实现这一点的建议都非常感谢。
两者都rcorr()
对correlate()
整个数据集进行操作。数据集包含NA
值,因此我需要使用这些函数,因为它们允许使用"pairwise.complete.obs"
.
尝试:
# Packages ----
library(corrr)
library(Hmisc)
library(disk.frame)
library(data.table)
# Initialise parallel processing backend
setup_disk.frame()
# Enable large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)
# test_DT is a data.table of ~18000 columns and ~800 rows
# of type `num` (`double`)
# Create filebacked disk.frame ----
test_DT_df <- as.disk.frame(
test_DT,
outdir = file.path(tempdir(), "test_tmp.df"),
nchunks = recommend_nchunks(test_DT, conservatism = 4),
overwrite = TRUE
)
# `Hmisc` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
cmap(
.x = test_DT_df,
.f = function(.x) {
Hmisc::rcorr(
x = as.matrix(.x),
type = "pearson"
)
}
),
overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)
# `corrr` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
cmap(
.x = test_DT_df,
.f = function(.x) {
corrr::correlate(
x = .x,
use = "pairwise.complete.obs",
method = "pearson"
)
}
),
overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)
# Cleanup ----
delete(test_DT_df)
delete(test_cor)
rm(test_DT_df, test_cor, test_cor_collect)
gc()