performance - R 循环运行太慢

Question

我有两个非常大的数据框（50MM+ 行），我需要对它们进行一些计算。我开发了以下循环，但运行速度太慢。我尝试使用 apply 和其他方法，但我无法让它们工作。

#### Sample Data
df=data.frame(id=1:10,time=Sys.time()-1:10,within5=NA)
df2=data.frame(id2=c(1,1,1,5,5,10),time2=Sys.time()-c(9,5,2,3,4,6))

#### Loop shows how many results from df2 are within 5 secs of the creation of the ID    in df
for (i in 1:length(df$id))
{
temp=df2[df2$id==df$id[i],]
df$within5[i]=sum(abs(as.numeric(difftime(temp$time2,df$time[i],units="secs")))<5)
}

score 3 · Accepted Answer

为了检查程序的改进，制作了更大的样本数据。

df=data.frame(id=1:100,time=Sys.time()-1:100)
df2=data.frame(id2=sample(1:100,300000,replace=T),time2=Sys.time()-sample(1:5,300000,replace=T))

使用ddply()package中的函数plyr根据 column 划分您的数据id2。然后将您的函数应用于每个子集。

library(plyr)
df3 <- ddply(df2,"id2",function(x){ 
    data.frame(within5=sum(abs(as.numeric(difftime(x$time2,df$time[df$id==x$id2[1]],units="secs")))<5))})

结果我们得到了新的数据框。

 head(df3)
  id2 within5
1   1    3129
2   2    3032
3   3    2935
4   4    3121
5   5    3042
6   6    2426

如果您需要within5原始数据框中的列，您可以使用 function merge()。

df4 <- merge(df,df3,by.x="id",by.y="id2",all=T)

使用我的样本数据，这个计算速度快了 10 倍。

score 1 · Accepted Answer

对于上面的数据，使用第二个 id 查找参考时间，并从中减去事件时间

dt <- df2$time2 - df$time[df2$id]

然后选择绝对时差小于 5 的事件 id

okIds <- df2$id2[abs(as.numeric(dt)) < 5]

将这些制成表格，并添加到您的原始数据框中

df$within5 <- tabulate(okIds, max(df$id))

这依赖于 id 是顺序整数（如果不是，则将它们设为 a factor()，然后使用整数对结果进行编码）并且速度非常快。

performance - R 循环运行太慢

2 回答 2

Related

Reference