r - 以节省内存的方式增长 data.frame

Question

根据Creating an R dataframe row-by-row，附加到data.frameusing并不理想rbind，因为它每次都会创建整个 data.frame 的副本。我如何积累数据以R导致data.frame不产生这种惩罚？中间格式不需要是data.frame.

score 45 · Accepted Answer

第一种方法

我尝试访问预先分配的 data.frame 的每个元素：

res <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
tracemem(res)
for(i in 1:1000) {
  res[i,"x"] <- runif(1)
  res[i,"y"] <- rnorm(1)
}

但是tracemem 变得疯狂（例如data.frame 每次都被复制到一个新地址）。

替代方法（也不起作用）

一种方法（不确定它是否更快，因为我还没有进行基准测试）是创建一个 data.frames 列表，然后将stack它们全部放在一起：

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))
res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames
library(taRifx)
res.df <- stack(res)

不幸的是，在创建列表时，我认为您将很难预先分配。例如：

> tracemem(res)
[1] "<0x79b98b0>"
> res[[2]] <- data.frame()
tracemem[0x79b98b0 -> 0x71da500]:

换句话说，替换列表的元素会导致列表被复制。我假设整个列表，但它可能只是列表中的那个元素。我对 R 的内存管理的细节并不十分熟悉。

可能是最好的方法

与当今许多速度或内存受限的进程一样，最好的方法可能是data.table使用data.frame. 由于data.table具有:=按引用分配的运算符，因此无需重新复制即可更新：

library(data.table)
dt <- data.table(x=rep(0,1000), y=rep(0,1000))
tracemem(dt)
for(i in 1:1000) {
  dt[i,x := runif(1)]
  dt[i,y := rnorm(1)]
}
# note no message from tracemem

但正如@MatthewDowle 指出的那样，set()这是在循环中执行此操作的适当方法。这样做可以让它更快：

library(data.table)
n <- 10^6
dt <- data.table(x=rep(0,n), y=rep(0,n))

dt.colon <- function(dt) {
  for(i in 1:n) {
    dt[i,x := runif(1)]
    dt[i,y := rnorm(1)]
  }
}

dt.set <- function(dt) {
  for(i in 1:n) {
    set(dt,i,1L, runif(1) )
    set(dt,i,2L, rnorm(1) )
  }
}

library(microbenchmark)
m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)

（结果如下所示）

基准测试

循环运行 10,000 次后，数据表几乎快了一个数量级：

Unit: seconds
          expr        min         lq     median         uq        max
1    test.df()  523.49057  523.49057  524.52408  525.55759  525.55759
2    test.dt()   62.06398   62.06398   62.98622   63.90845   63.90845
3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622

:=和与的比较set()：

> m
Unit: milliseconds
          expr       min        lq    median       uq      max
1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186
2   dt.set(dt)  13.29612  13.29612  15.02891  16.7617  16.7617

请注意，n这里是 10^6 而不是上面绘制的基准中的 10^5。因此，工作量增加了一个数量级，结果以毫秒而不是秒为单位。确实令人印象深刻。

score 9 · Accepted Answer

您还可以有一个空列表对象，其中元素填充有数据框；然后用 sapply 或类似的方法在最后收集结果。一个例子可以在这里找到。这不会招致增长对象的惩罚。

score 7 · Accepted Answer

好吧，我很惊讶没有人提到转换为矩阵...

与Ari B. Friedman定义的dt.colon和dt.set函数相比，转换为矩阵的运行时间最好（比dt.colon稍快）。矩阵内的所有影响都是通过引用完成的，因此在此代码中没有执行不必要的内存复制。

代码：

library(data.table)
n <- 10^4
dt <- data.table(x=rep(0,n), y=rep(0,n))

use.matrix <- function(dt) {
  mat = as.matrix(dt)  # converting to matrix
  for(i in 1:n) {
    mat[i,1] = runif(1)
    mat[i,2] = rnorm(1)
  }
  return(as.data.frame(mat))  # converting back to a data.frame
}


dt.colon <- function(dt) { # same as Ari's function
  for(i in 1:n) {
    dt[i,x := runif(1)]
    dt[i,y := rnorm(1)]
  }
}

dt.set <- function(dt) { # same as Ari's function
  for(i in 1:n) {
    set(dt,i,1L, runif(1) )
    set(dt,i,2L, rnorm(1) )
  }
}

library(microbenchmark)
microbenchmark(dt.colon(dt), dt.set(dt), use.matrix(dt),times=10)

结果：

Unit: milliseconds
           expr        min         lq     median         uq        max neval
   dt.colon(dt) 7107.68494 7193.54792 7262.76720 7277.24841 7472.41726    10
     dt.set(dt)   93.25954   94.10291   95.07181   97.09725   99.18583    10
 use.matrix(dt)   48.15595   51.71100   52.39375   54.59252   55.04192    10

使用矩阵的优点：

这是迄今为止最快的方法
您不必学习/使用 data.table 对象

使用矩阵的缺点：

您只能处理矩阵中的一种数据类型（特别是，如果您在 data.frame 的列中有混合类型，那么它们都将按行转换为字符：mat = as.matrix(dt) # conversion到矩阵）

score 6 · Accepted Answer

我喜欢RSQLite这件事：dbWriteTable(...,append=TRUE)收集时dbReadTable的陈述，最后的陈述。

如果数据足够小，可以使用“:memory:”文件，如果数据大，可以使用硬盘。

当然，它无法在速度方面竞争：

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))

library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), ":memory:")

collect1 <- function(n) {
  for (i in 1:n) dbWriteTable(con, "test", makeRow(), append=TRUE)
  dbReadTable(con, "test", row.names=NULL)
}

collect2 <- function(n) {
  res <- data.frame(x=rep(NA, n), y=rep(NA, n))
  for(i in 1:n) res[i,] <- makeRow()[1,]
  res
}

> system.time(collect1(1000))
   User      System verstrichen 
   7.01        0.00        7.05  
> system.time(collect2(1000))
   User      System verstrichen 
   0.80        0.01        0.81

但如果data.frames 有不止一行，它可能看起来会更好。而且您不需要提前知道行数。

score 3 · Accepted Answer

这篇文章建议使用剥离data.frame/tibble的类属性as.list，以通常的方式就地分配列表元素，然后再次将结果转换回data.frame/ tibble。该方法的计算复杂度呈线性增长，但增长率很小，小于 10e-6。

in_place_list_bm <- function(n) {
    res <- tibble(x = rep(NA_real_, n))
    tracemem(res)
    res <- as.list(res)
    for (i in 1:n) {
        res[['x']][[i]] <- i
    }
    return(res %>% as_tibble())
}

> system.time(in_place_list_bm(10000))[[3]]
tracemem[0xd87aa08 -> 0xd87aaf8]: as.list.data.frame as.list in_place_list_bm system.time 
tracemem[0xd87aaf8 -> 0xd87abb8]: in_place_list_bm system.time 
tracemem[0xd87abb8 -> 0xe045928]: check_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0xe045928 -> 0xe043488]: new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0xe043488 -> 0xe043728]: set_tibble_class new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
[1] 0.006
> system.time(in_place_list_bm(100000))[[3]]
tracemem[0xdf89f78 -> 0xdf891b8]: as.list.data.frame as.list in_place_list_bm system.time 
tracemem[0xdf891b8 -> 0xdf89278]: in_place_list_bm system.time 
tracemem[0xdf89278 -> 0x5e00fb8]: check_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0x5e00fb8 -> 0x5dd46b8]: new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0x5dd46b8 -> 0x5dcec98]: set_tibble_class new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
[1] 0.045

这是原始文章中的图片：

r - 以节省内存的方式增长 data.frame

5 回答 5

Related

Reference