performance - data.frame 方法的最有效列表？

Question

刚刚与同事就此进行了交谈，我们认为值得看看 SO 土地上的人们怎么说。假设我有一个包含 N 个元素的列表，其中每个元素都是长度为 X 的向量。现在假设我想将其转换为 data.frame。与 R 中的大多数东西一样，有多种方法可以给众所周知的猫剥皮，例如as.dataframe，使用 plyr 包、do.call与组合cbind、预先分配 DF 并填充它等等。

提出的问题是当 N 或 X（在我们的例子中是 X）变得非常大时会发生什么。当效率（特别是在记忆方面）至关重要时，是否有一种猫剥皮方法特别优越？

score 29 · Accepted Answer

由于 adata.frame已经是一个列表，并且您知道每个列表元素的长度（X）相同，因此最快的方法可能是只更新classandrow.names属性：

set.seed(21)
n <- 1e6
x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
x <- c(x,x,x,x,x,x)

system.time(a <- as.data.frame(x))
system.time(b <- do.call(data.frame,x))
system.time({
  d <- x  # Skip 'c' so Joris doesn't down-vote me! ;-)
  class(d) <- "data.frame"
  rownames(d) <- 1:n
  names(d) <- make.unique(names(d))
})

identical(a, b)  # TRUE
identical(b, d)  # TRUE

更新- 这比创建快约 2 倍d：

system.time({
  e <- x
  attr(e, "row.names") <- c(NA_integer_,n)
  attr(e, "class") <- "data.frame"
  attr(e, "names") <- make.names(names(e), unique=TRUE)
})

identical(d, e)  # TRUE

更新 2 - 我忘记了内存消耗。最后一次更新制作了两份e. 使用该attributes功能可将其减少到只有一份。

set.seed(21)
f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
f <- c(f,f,f,f,f,f)
tracemem(f)
system.time({  # makes 2 copies
  attr(f, "row.names") <- c(NA_integer_,n)
  attr(f, "class") <- "data.frame"
  attr(f, "names") <- make.names(names(f), unique=TRUE)
})

set.seed(21)
g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
g <- c(g,g,g,g,g,g)
tracemem(g)
system.time({  # only makes 1 copy
  attributes(g) <- list(row.names=c(NA_integer_,n),
    class="data.frame", names=make.names(names(g), unique=TRUE))
})

identical(f,g)  # TRUE

score 10 · Accepted Answer

鉴于需要大型数据集的效率，这似乎需要一个data.table建议。值得注意的是setattr通过引用设置并且不复制

library(data.table)
set.seed(21)
n <- 1e6
h <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
h <- c(h,h,h,h,h,h)
tracemem(h)

system.time({h <- as.data.table(h)
            setattr(h, 'names', make.names(names(h), unique=T))})

as.data.table，但是确实制作了副本。

编辑 - 没有复制版本

使用@MatthewDowle 的建议setattr(h,'class','data.frame')，它将通过引用转换为 data.frame（无副本）

set.seed(21)
n <- 1e6
i <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
i <- c(i,i,i,i,i,i)
tracemem(i)

system.time({  
  setattr(i, 'class', 'data.frame')
  setattr(i, "row.names", c(NA_integer_,n))

  setattr(i, "names", make.names(names(i), unique=TRUE))

})

performance - data.frame 方法的最有效列表？

2 回答 2

编辑 - 没有复制版本

Related

Reference