r - 为什么在具有许多组的大型数据帧上拆分效率低下？

Question

df %>% split(.$x)

对于 x 的大量唯一值，变得很慢。如果我们手动将数据帧拆分为更小的子集，然后对每个子集执行拆分，我们将时间减少至少一个数量级。

library(dplyr)
library(microbenchmark)
library(caret)
library(purrr)

N      <- 10^6
groups <- 10^5
df     <- data.frame(x = sample(1:groups, N, replace = TRUE), 
                     y = sample(letters,  N, replace = TRUE))
ids      <- df$x %>% unique
folds10  <- createFolds(ids, 10)
folds100 <- createFolds(ids, 100)

跑步microbenchmark给了我们

## Unit: seconds

## expr                                                  mean
l1 <- df %>% split(.$x)                                # 242.11805

l2 <- lapply(folds10,  function(id) df %>% 
      filter(x %in% id) %>% split(.$x)) %>% flatten    # 50.45156  

l3 <- lapply(folds100, function(id) df %>% 
      filter(x %in% id) %>% split(.$x)) %>% flatten    # 12.83866

是split不是为大型团体设计的？除了手动初始子集之外还有其他选择吗？

我的笔记本电脑是 2013 年末的 macbook pro，2.4GHz 8GB

score 11 · Accepted Answer

更多的是解释而不是答案。子设置大数据框比子设置小数据框成本更高

> df100 = df[1:100,]
> idx = c(1, 10, 20)
> microbenchmark(df[idx,], df100[idx,], times=10)
Unit: microseconds
         expr     min      lq     mean  median      uq     max neval
    df[idx, ] 428.921 441.217 445.3281 442.893 448.022 475.364    10
 df100[idx, ]  32.082  32.307  35.2815  34.935  37.107  42.199    10

split()为每个组支付这笔费用。

运行可以看出原因Rprof()

> Rprof(); for (i in 1:1000) df[idx,]; Rprof(NULL); summaryRprof()
$by.self
       self.time self.pct total.time total.pct
"attr"      1.26      100       1.26       100

$by.total
               total.time total.pct self.time self.pct
"attr"               1.26       100      1.26      100
"[.data.frame"       1.26       100      0.00        0
"["                  1.26       100      0.00        0

$sample.interval
[1] 0.02

$sampling.time
[1] 1.26

所有时间都花在调用attr(). 使用单步执行代码debug("[.data.frame")表明痛苦涉及到类似的调用

attr(df, "row.names")

这个小例子展示了 R 用来避免表示不存在的行名的技巧：使用c(NA, -5L)，而不是1:5。

> dput(data.frame(x=1:5))
structure(list(x = 1:5), .Names = "x", row.names = c(NA, -5L), class = "data.frame")

请注意，它attr()返回一个向量——row.names 是动态创建的，对于大型 data.frame，会创建大量 row.names

> attr(data.frame(x=1:5), "row.names")
[1] 1 2 3 4 5

因此，人们可能会期望即使是无意义的 row.names 也会加快计算速度

> dfns = df; rownames(dfns) = rev(seq_len(nrow(dfns)))
> system.time(split(dfns, dfns$x))
   user  system elapsed 
  4.048   0.000   4.048 
> system.time(split(df, df$x))
   user  system elapsed 
 87.772  16.312 104.100

拆分向量或矩阵也很快。

score 2 · Accepted Answer

This isn't strictly split.data.frame issue, there is a more general problem on scalability of data.frame for many groups.
You can get pretty nice speed up if you use split.data.table. I developed this method on top of regular data.table methods and it seems to scale pretty well here.

system.time(
    l1 <- df %>% split(.$x)   
)
#   user  system elapsed 
#200.936   0.000 217.496 
library(data.table)
dt = as.data.table(df)
system.time(
    l2 <- split(dt, by="x")   
)
#   user  system elapsed 
#  7.372   0.000   6.875 
system.time(
    l3 <- split(dt, by="x", sorted=TRUE)   
)
#   user  system elapsed 
#  9.068   0.000   8.200

sorted=TRUE will return the list of the same order as data.frame method, by default data.table method will preserve order present in input data. If you want to stick to data.frame you can at the end use lapply(l2, setDF).

PS. split.data.table was added in 1.9.7, installation of devel version is pretty simple

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

More about that in Installation wiki.

score -1 · Accepted Answer

group_split利用dplyr 0.8.3 或更高版本的一个非常好的作弊：

random_df <- tibble(colA= paste("A",1:1200000,sep = "_"), 
                    colB= as.character(paste("A",1:1200000,sep = "_")),
                    colC= 1:1200000)

random_df_list <- split(random_df, random_df$colC)

random_df_list <- random_df %>% group_split(colC)

将操作从几分钟缩短到几秒钟！

r - 为什么在具有许多组的大型数据帧上拆分效率低下？

3 回答 3

Related

Reference