performance - 子集数据框的最有效方法

Question

SQL/indexing/data.table任何人都可以在不使用选项的情况下建议更有效的子集数据框方法吗？

我寻找了类似的问题，这个问题建议使用索引选项。

以下是使用时间进行子集化的方法。

#Dummy data
dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000))

#Subset and time
system.time(x <- dat[dat$x > 500, ])
#   user  system elapsed 
#  0.092   0.000   0.090 
system.time(x <- dat[which(dat$x > 500), ])
#   user  system elapsed 
#  0.040   0.032   0.070 
system.time(x <- subset(dat, x > 500))
#   user  system elapsed 
#  0.108   0.004   0.109

编辑： 正如 Roland 建议的那样，我使用了 microbenchmark。它似乎which表现最好。

library("ggplot2")
library("microbenchmark")

#Dummy data
dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000))

#Benchmark
res <- microbenchmark( dat[dat$x > 500, ],
                       dat[which(dat$x > 500), ],
                       subset(dat, x > 500))
#plot
autoplot.microbenchmark(res)

在此处输入图像描述

score 1 · Accepted Answer

正如 Roland 建议的那样，我使用了 microbenchmark。它似乎which表现最好。

library("ggplot2")
library("microbenchmark")

#Dummy data
dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000))

#Benchmark
res <- microbenchmark( dat[dat$x > 500, ],
                       dat[which(dat$x > 500), ],
                       subset(dat, x > 500))
#plot
autoplot.microbenchmark(res)

在此处输入图像描述

performance - 子集数据框的最有效方法

1 回答 1

Related

Reference