r - 从两个表的组合中找到最大值（for循环太慢）

Question

我有一个数据表“the.data”，其中第一列表示测量仪器，其余不同的测量数据。

instrument <- c(1,2,3,4,5,1,2,3,4,5)
hour <- c(1,1,1,1,1,2,2,2,2,2)
da <- c(12,14,11,14,10,19,15,16,13,11)
db <- c(21,23,22,29,28,26,24,27,26,22)
the.data <- data.frame(instrument,hour,da,db)

我还定义了仪器组，例如，组 1 (g1) 指的是仪器 1 和 2。

g1 <- c(1,2)
g2 <- c(4,3,1)
g3 <- c(1,5,2)
g4 <- c(2,4)
g5 <- c(5,3,1,2,6)
groups <- c("g1","g2","g3","g4","g5")

我需要找出每个组的总和在哪个小时具有每种数据类型的最大值，以及它的总和。

g1 小时 1：sum(da)=12+14=26 g1 小时 2：sum(da)=19+15=34

因此，对于 g1 和 da，答案是 2 小时和 34 小时。

我在 for 循环中使用 for 循环来执行此操作，但这需要很长时间（几个小时后我中断了）。问题是.data 大约有 100.000 行长，并且大约有 5.000 个组，每个组有 2-50 个仪器。

有什么好的方法可以做到这一点？

衷心感谢 Stack-overflow 的所有贡献者。

更新：现在示例中只有五个组。

/克里斯

score 4 · Accepted Answer

循环将group不得不保留，或者充其量被类似lapply(). hour但是，可以通过重新格式化为矩阵然后只进行向量化代数来完全替换循环instrument x hour。例如：

library(reshape2)

groups = list(g1, g3)

the.data.a = dcast(the.data[,1:3], instrument ~ hour)

> sapply(groups, function(x) data.frame(max = max(colSums(the.data.a[x, -1])),
                                        ind = which.max(colSums(the.data.a[x, -1]))))
    [,1] [,2]
max 34   45  
ind 2    2

score 3 · Accepted Answer

这是John Colby 的回答稍作修改的版本，带有一些示例数据。

set.seed(21)
instrument <- sample(100, 1e5, TRUE)
hour <- sample(24, 1e5, TRUE)
da <- trunc(runif(1e5)*10)
db <- trunc(runif(1e5)*10)
the.data <- data.frame(instrument,hour,da,db)
groups <- replicate(5000, sample(100, sample(50,1)))
names(groups) <- paste("g",1:length(groups),sep="")

library(reshape2)
system.time({    
the.data.a <- dcast(the.data[,1:3], instrument ~ hour, sum)
out <- t(sapply(groups, function(i) {
  byHour <- colSums(the.data.a[i,-1])
  c(max(byHour), which.max(byHour))
}))
colnames(out) <- c("max.hour","max.sum")
})
# Using da as value column: use value.var to override.
#    user  system elapsed 
#    3.80    0.00    3.81

score 2 · Accepted Answer

这是使用plyr和reshape2来自 Hadley 的一种方法。首先，我们将the.data根据仪器是否在该组中添加一些布尔值。然后我们将其融合为长格式，将不需要的行子集化，然后使用ddplyor进行分组操作data.table。

#add boolean columns
the.data <- transform(the.data, 
                      g1 = instrument %in% g1,
                      g2 = instrument %in% g2,
                      g3 = instrument %in% g3,
                      g4 = instrument %in% g4,
                      g5 = instrument %in% g5
                      )

#load library
library(reshape2)
#melt into long format
the.data.m <- melt(the.data, id.vars = 1:4)
#subset out data that that has FALSE for the groupings
the.data.m <- subset(the.data.m, value == TRUE)

#load plyr and data.table
library(plyr)
library(data.table)

#plyr way
ddply(the.data.m, c("variable", "hour"), summarize, out = sum(da))
#data.table way
dt <- data.table(the.data.m)
dt[, list(out = sum(da)), by = "variable, hour"]

做一些基准测试，看看哪个更快：

library(rbenchmark)   
f1 <- function() ddply(the.data.m, c("variable", "hour"), summarize, out = sum(da))
f2 <- function() dt[, list(out = sum(da)), by = "variable, hour"]

> benchmark(f1(), f2(), replications=1000, order="elapsed", columns = c("test", "elapsed", "relative"))
  test elapsed relative
2 f2()    3.44 1.000000
1 f1()    6.82 1.982558

因此，对于此示例，data.table 的速度大约快了 2 倍。您的里程可能会有所不同。

只是为了表明它给出了正确的值：

> dt[, list(out = sum(da)), by = "variable, hour"]
      variable hour out
 [1,]       g1    1  26
 [2,]       g1    2  34
 [3,]       g2    1  25
 [4,]       g2    2  29

...

score 2 · Accepted Answer

您没有提供代码（或生成组的编程方式，组数为 5000 时似乎需要），但这可能是对 R 的更有效使用：

groups <- list(g1,g2,g3,g4,g5)
gmax <- list()
# The "da" results
for( gitem in seq_along(groups) ) { 
       gmax[[gitem]] <- with( subset(the.data , instrument %in% groups[[gitem]]),  
                               tapply(da , hour, sum) ) }
damat <- matrix(c(sapply(gmax, which.max), 
                  sapply(gmax, max)) , ncol=2)

# The "db" results
for( gitem in seq_along(groups) ) { 
       gmax[[gitem]] <- with( subset(the.data , instrument %in% groups[[gitem]]),  
                               tapply(db , hour, sum) ) }
dbmat <- matrix(c(sapply(gmax, which.max), 
                  sapply(gmax, max)) , ncol=2)

#--------
> damat
     [,1] [,2]
[1,]    2   34
[2,]    2   29
[3,]    2   45
[4,]    1   14
[5,]    2   42
> dbmat
     [,1] [,2]
[1,]    2   50
[2,]    2   53
[3,]    1   72
[4,]    1   29
[5,]    1   73

r - 从两个表的组合中找到最大值（for循环太慢）

4 回答 4

Related

Reference