r - 通过变量拆分 data.frame

Question

我有来自多个主题的数据存储在一个 CSV 文件中。导入 CSV 文件后，我想将每个参与者的数据拆分到自己的 data.frame 中。

更确切地说，我想以下面的示例数据，并创建三个新的 data.frames；每个“subject_initials”值一个。

在此处输入图像描述

我该怎么做呢？到目前为止，我已经研究了使用plyr包 and的选项split()，但还没有找到解决方案。我知道我可能遗漏了一些明显的东西。

score 12 · Accepted Answer

split 在这里似乎很合适。

如果您从以下数据框开始：

df <- data.frame(ids=c(1,1,2,2,3),x=1:5,y=letters[1:5])

然后你可以这样做：

split(df, df$ids)

你会得到一个数据框列表：

R> split(df, df$ids)
$`1`
  ids x y
1   1 1 a
2   1 2 b

$`2`
  ids x y
3   2 3 c
4   2 4 d

$`3`
  ids x y
5   3 5 e

score 12 · Accepted Answer

split是通用的。虽然split.default速度非常快，split.data.frame但当要拆分的级别数量增加时会变得非常慢。

替代（更快）的解决方案是使用data.table. 我将在这里说明更大数据的差异：

样本数据（@Roland 在他的评论中指的是什么）

require(data.table)
set.seed(45)
DF <- data.frame(ids = sample(1e4, 1e6, TRUE), x = sample(letters, 1e6, TRUE), 
                  y = runif(1e6))
DT <- as.data.table(DF)

功能+基准测试

请注意，这里的数据顺序会有所不同，因为按“ids”进行拆分排序。如果你愿意，你可以先做setkey(DT, ids)然后运行f2。

f1 <- function() split(DF, DF$ids)
f2 <- function() {
    ans <- DT[, list(list(.SD)), by=ids]$V1
    setattr(ans, 'names', unique(DT$ids)) # sets names by reference, no copy here.
}

require(microbenchmark)
microbenchmark(ans1 <- f1(), ans2 <- f2(), times=10)

# Unit: milliseconds
#          expr        min         lq     median         uq       max neval
#  ans1 <- f1() 37015.9795 43994.6629 48132.3364 49086.0926 63829.592    10
#  ans2 <- f2()   332.6094   361.1902   409.2191   528.0674  1005.457    10

split.data.frame平均耗时48 秒，而平均data.table耗时0.41 秒

r - 通过变量拆分 data.frame

2 回答 2

样本数据（@Roland 在他的评论中指的是什么）

功能+基准测试

Related

Reference