编辑:更大数据集的基准。
使用 tapply 函数计算每个地区的人口:
districtdata$population<-
tapply(towndata$population,towndata$district_ID,sum)[districts$district_ID]
一些基准测试,只是为了好玩:
fn1<-function(districts,towns)
{
districts$population<-
tapply(towns$population,towns$district_ID,sum)[districts$district_ID]
districts
}
fn2<-function(districts,towns) #Roland's data.table approach:
{
districts <- data.table(districts,key="district_ID")
towns <- data.table(towns,key="district_ID")
temp<-towns[,list(district_pop=sum(population)),by=district_ID]
merge(districts,temp)
}
set.seed(42)
districts <- data.frame(district_ID=1:300,whatever=rnorm(300))
towns <- data.frame(town=1:100000,district_ID=rep(1:300,each=300),
population=rpois(300000,sample(c(1e3,1e4,1e5))))
microbenchmark(fn1(districts,towns),fn2(districts,towns))
Unit: milliseconds
expr min lq median uq max neval
fn1(districts, towns) 215.29266 231.47103 243.72353 265.28280 355.43895 100
fn2(districts, towns) 20.03636 27.51046 36.11116 58.56448 88.70766 100