r - 在 R 中构建 model.matrix 无法放入内存（尝试了所有内存映射包）

Question

我正在尝试lm()为大型销售数据集估计 R 中的装备。数据本身并没有大到 R 无法处理；大约 250MB 的内存。问题是当lm()被调用以包含所有变量和交叉项时，构造会model.matrix()引发错误，指出机器内存不足并且无法分配大小为任何大小的向量（在本例中约为 47GB）。可以理解，我没有那么多内存。问题是，我已经尝试了ff、bigmemory和filehash包，所有这些都可以在内存之外使用现有文件正常工作（我特别喜欢的数据库功能filehash）。但我不能，为了我的一生，得到model.matrix被创造出来。我认为问题在于，尽管将输出文件映射到我创建的数据库，R 还是尝试在 RAM 中设置它，但不能。有没有办法使用这些包来避免这种情况，或者我做错了什么？[此外，使用biglm和其他功能来分块地做事甚至不允许我一次一个地分块。再一次，似乎 R 试图先制作整个model.matrix，然后再分块]

任何帮助将不胜感激！

library(filehash)
library(ff)
library(ffbase)
library(bigmemory)
library(biganalytics)
library(dummies)
library(biglm)
library(dplyr)
library(lubridate)
library(data.table)



SID <- readRDS('C:\\JDA\\SID.rds')
SID <- as.data.frame(unclass(SID)) # to get characters as Factors

dbCreate('reg.db')
db <- dbInit('reg.db')
dbInsert(db, 'SID', SID)
rm(SID)
gc()

db$summary1 <-
  db$SID %>%
  group_by(District, Liable, TPN, mktYear, Month) %>%
  summarize(NV.sum = sum(NV))

start.time <- Sys.time()
# Here is where it throws the error:
db$fit <- lm(NV.sum ~ .^2, data = db$summary1)
Sys.time() - start.time
rm(start.time)
gc()

summary(fit)
anova(fit)

score 0 · Accepted Answer

这是基于 Matrix 包中的求解方法的示例：

> ?`solve-methods`
> n1 <- 7; n2 <- 3
> dd <- data.frame(a = gl(n1,n2), b = gl(n2,1,n1*n2))# balanced 2-way
> X <- sparse.model.matrix(~ -1+ a + b, dd)# no intercept --> even sparser
> Y <- rnorm(nrow(X))
> # Forming normal equations manually and solving for beta-hat 
> solve(crossprod(X), crossprod(X, Y))
9 x 1 Matrix of class "dgeMatrix"
            [,1]
 [1,]  1.2384385
 [2,]  1.3313779
 [3,]  0.7497135
 [4,]  0.7840841
 [5,]  0.9586135
 [6,]  0.4667769
 [7,]  1.6648260
 [8,] -1.6669776
 [9,] -1.1142240

r - 在 R 中构建 model.matrix 无法放入内存（尝试了所有内存映射包）

1 回答 1

Related

Reference