What I really like about data.table
is the :=
idiom for changing the table by reference, without the need for costly copies. From what I understand, this is one of the aspects which makes data.table
so ultrafast compared to other methods.
Now, I started playing around with the dplyr
package which seems to be equally performant. But since results still have to be assigned using the <-
operator, I was expecting a performance drain at this level. However, there seems to be none.
As an example:
library(dplyr)
library(Lahman)
library(microbenchmark)
library(ggplot2)
df <- Batting[ c("yearID", "teamID", "G_batting") ]
mb <- microbenchmark(
dplyr = {
tb <- tbl_df( df )
tb <- tb %.%
group_by( yearID, teamID ) %.%
mutate( G_batting = max(G_batting) )
},
data.table = {
dt <- as.data.table( df )
dt[ , G_batting := max(G_batting), by = list( yearID, teamID ) ]
},
times = 500
)
qplot( data = mb, x = expr, y = time * 1E-6, geom = "boxplot", ylab="time [ms]", xlab = "approach" )
I am just wondering how this is possible? Or is there a conceptual mistake in the way I benchmark? Is my understanding of <-
wrong?