这是一个使用示例数据的示例,该数据集最初有 1000 万行,有 100 个用户,diff
每个有 100,000 个时间点,然后是 1.4 亿行,有 1,400 个用户,所以时间点数相同。这会将时间点转置到列。我应该想象,如果您将用户转移到列中,它会更快。我在这里使用@Arun 的答案作为模板。基本上它表明,在一张非常大的桌子上,您可以在 < 90 秒内在单核 (i7 2.6 GhZ) 上完成此操作(并且使用可能未完全优化的代码):
require(data.table)
## Smaller sample dataset - 10 million row, 100 users, 100,000 time points each
DT <- data.table( Date = sample(100,1e7,repl=TRUE) , User = rep(1:100,each=1e5) )
## Size of table in memory
tables()
# NAME NROW MB COLS KEY
#[1,] DT 10,000,000 77 Date,User
#Total: 77MB
## Diff by user
dt.test <- quote({
DT2 <- DT[ , list(Diff=diff(c(0,Date))) , by=list(User) ]
DT2 <- DT2[, as.list(setattr(Diff, 'names', 1:length(Diff))) , by = list(User)]
})
## Benchmark it
require(microbenchmark)
microbenchmark( eval(dt.test) , times = 5L )
#Unit: seconds
# expr min lq median uq max neval
# eval(dt.test) 5.788364 5.825788 5.9295 5.942959 6.109157 5
## And with 140 million rows...
DT <- data.table( Date = sample(100,1.4e8,repl=TRUE) , User = rep(1:1400,each=1e5) )
#tables()
# NAME NROW MB
#[1,] DT 140,000,000 1069
microbenchmark( eval(dt.test) , times = 1L )
#Unit: seconds
# expr min lq median uq max neval
# eval(dt.test) 84.3689 84.3689 84.3689 84.3689 84.3689 1