2

与数据帧相比,tapply使用类似操作的速度提高了多少,这给我留下了深刻的印象。data.table

例如:

df = data.frame(class = round(runif(1e6,1,1000)), x=rnorm(1e6))
DT = data.table(df)

# takes ages if somefun is complex
res1 = tapply(df$x, df$class, somefun) 

# takes much faster 
setkey(DT, class)
res2 = DT[,somefun(x),by=class] 

apply但是,在类似操作(即,需要将函数应用于每一行的情况)中,我并没有设法让它工作得比数据帧快得多。

df = data.frame(x1 = rnorm(1e6), x2=rnorm(1e6))
DT = data.table(df)

# takes ages if somefun is complex
res1 = apply(df, 1, somefun) 

# not much improvement, if at all 
DT[,rowid:=.I] # or: DT$rowid = 1:nrow(DT)
setkey(DT, rowid)
res2 = DT[,somefun1(x1,x2),by=rowid] 

这真的只是意料之中还是有一些技巧?

4

2 回答 2

5

If you cannot vectorize your function (because of recursivity...) then you fall in Rcpp territory. Usual rule to use Rcpp and data.table is

  1. shape your data.table accordingly (setkey...)
  2. write you C?C++ function say f that would take a Rcpp::DataFrame and return a Rcpp::List
  3. update by reference doing cppOutList <- f(DT), DT[,names(cppOutList):=cppOutList]

Doing this usually make you save orders of magnitude

于 2013-05-24T15:00:53.070 回答
0

您可能可以使用set. 这里有一个很好的基准:Row operations in data.table using `by = .I`

于 2016-06-13T21:06:31.957 回答