r - 为 R 数据框中的记录块创建序列号

Question

我有一个相当大的数据集（按照我的标准），我想为记录块创建一个序列号。我可以使用 plyr 包，但是执行时间很慢。下面的代码复制了一个可比较大小的数据框。

## simulate an example of the size of a normal data frame
N <- 30000
id <- sample(1:17000, N, replace=T)
term <- as.character(sample(c(9:12), N, replace=T))
date <- sample(seq(as.Date("2012-08-01"), Sys.Date(), by="day"), N, replace=T)
char <- data.frame(matrix(sample(LETTERS, N*50, replace=T), N, 50))
val <- data.frame(matrix(rnorm(N*50), N, 50))
df <- data.frame(id, term, date, char, val, stringsAsFactors=F)
dim(df)

实际上，这比我使用的要小一些，因为这些值通常更大……但这已经足够接近了。

这是我机器上的执行时间：

> system.time(test.plyr <- ddply(df, 
+                                .(id, term), 
+                                summarise, 
+                                seqnum = 1:length(id), 
+                                .progress="text"))
  |===============================================================================================| 100%
   user  system elapsed 
  63.52    0.03   63.85

有一个更好的方法吗？不幸的是，我在 Windows 机器上。

提前致谢。

编辑： Data.table 非常快，但我无法正确计算序列号。这是我的 ddply 版本创建的。大多数在组中只有一条记录，但有些有 2 行、3 行等。

> with(test.plyr, table(seqnum))
seqnum
    1     2     3     4     5 
24272  4950   681    88     9

并使用如下所示的 data.table ，同样的方法产生：

> with(test.dt, table(V1))
V1
    1 
24272

score 5 · Accepted Answer

利用data.table

dt = data.table(df)
test.dt = dt[,.N,"id,term"]

这是一个时间比较。我使用 N = 3000 并在生成数据集时将 17000 替换为 1700

f_plyr <- function(){
  test.plyr <- ddply(df, .(id, term), summarise, seqnum = 1:length(id), 
 .progress="text")
}

f_dt <- function(){
 dt = data.table(df)
 test.dt = dt[,.N,"id,term"]
}

library(rbenchmark)
benchmark(f_plyr(), f_dt(), replications = 10,
  columns = c("test", "replications", "elapsed", "relative"))

data.table将速度提高 170 倍

test replications elapsed relative
2   f_dt()           10   0.779    1.000
1 f_plyr()           10 132.572  170.182

还可以查看 Hadley 在dplyr. dplyr如果提供额外的加速，我不会感到惊讶，因为很多代码都在 C 中重新编写。

更新：编辑代码，length(id)根据.N马特的评论更改。

r - 为 R 数据框中的记录块创建序列号

1 回答 1

Related

Reference