r - R中的慢循环，有什么建议可以加快速度吗？

Question

我有一个数据框“m”，如下所示：

我正在尝试查找每个帐户最活跃的月份（V1 数量最多）。例如，对于帐户“2”，它将是“month 6”，对于帐户 3，它将是“month 1”，...。

我写了下面的循环，它工作正常，但即使我只使用了 8000 行，也需要很长时间，整个数据集有 250,000 行，所以下面的代码不可用。有没有人可以提出一种更好的方法来实现这一目标？

非常感谢。

score 3 · Accepted Answer

您可以使用 plyr 做到这一点

library(plyr)
ddply(m, "AccountID", subset, V1==max(V1))

编辑：要按月获得结果，只需更改 de“id”变量

library(plyr)
ddply(m, "Month", subset, V1==max(V1))

score 2 · Accepted Answer

我认为 Owe Jessen 的评论是正确的，这不是问题的答案。所以这是我在data.table.

首先，让我们使用一个更容易理解的示例：

library(data.table)
DT <- data.table(AccountID = rep(1:3, each=4),
                 V1        = sample(1:100, 12, replace=FALSE),
                 Month     = rep(1:4, times=3))
      AccountID V1 Month
 [1,]         1 81     1
 [2,]         1 23     2
 [3,]         1 72     3
 [4,]         1 36     4
 [5,]         2 22     1
 [6,]         2 13     2
 [7,]         2 50     3
 [8,]         2 40     4
 [9,]         3 74     1
[10,]         3 83     2
[11,]         3  4     3
[12,]         3  3     4

所以在这里我们有 3 个账户和四个月，对于每个账户/月组合，我们有一个 V1。因此，为每个帐户找到最大 V1，我执行以下操作：

setkey(DT, AccountID)
DT <- DT[, list(maxV1=max(V1)), by="AccountID"][DT]
DT[maxV1==V1]
     AccountID maxV1 V1 Month
[1,]         1    81 81     1
[2,]         2    50 50     3
[3,]         3    83 83     2

这有点难以理解，所以我试着解释一下：我将 AccountID 设置为 DT 的 key。现在，我基本上在DT[, list(maxV1=max(V1)), by="AccountID"][DT]. 首先，我计算每个帐户 ( DT[, list(maxV1=max(V1)), by="AccountID"]) 的最大 V1 值，然后通过[DT]在它之后调用，我将这个新列添加maxV1到旧的DT. 显然，那么我只需要获取所有maxV1==V1保留的行。

将此解决方案应用于 Nico 更高级的示例，并向您展示如何将 a 转换data.frame为 a data.table：

library(data.table)
DT <- as.data.table(m)
#Note that this line is only necessary if there are more than one rows per Month/AccountID combination
DT <- DT[, sum(V1), by="Month,AccountID"]
setkey(DT, AccountID)
DT <- DT[, list(maxV1=max(V1)), by="AccountID"][DT]
DT[maxV1==V1]
   AccountID maxV1 Month    V1
           1 24660     1 24660
           2 22643     2 22643
           3 23642     3 23642
           4 22766     5 22766
           5 22445    12 22445
...

这正好给出了 50 行。

编辑：

这是一个base-R解决方案：

df <- data.frame(AccountID = rep(1:3, each=4),
                 V1        = sample(1:100, 12, replace=FALSE),
                 Month     = rep(1:4, times=3))
df$maxV1 <- ave(df$V1, df$AccountID, FUN = max)
df[df$maxV1==df$V1, ]

我从这里获得灵感。

score 1 · Accepted Answer

我想基本上这与 Tal 的解决方案相同

我通过以下循环获得了合理的时间

# Generate some random data
AccountID <- sample(1:50, 250000, replace=T)
V1 <- sample(1:100, 250000, replace=T)
Month <- sample(1:12, 250000, replace=T)

m <- data.frame(AccountID, V1, Month)

# Aggregate the data by month

ac = as.numeric(levels(as.factor(m$AccountID)))
active.month = rep(NA, length(ac))
names(active.month) = ac

system.time(
{
  for(i in ac)
  {
    subm = subset(m, AccountID == i)
    active.month[i] = subm[which.max(subm[,"V1"]),"Month"]
  }
})
   User      System verstrichen 
   0.78        0.14        0.92

score 1 · Accepted Answer

我看不到向量化该算法的方法（如果其他人这样做，我很想知道如何）。

这是我将如何编码（ps：请在将来包含自包含代码。也请查看 ?dput 以获取帮助）：

make.data <- function(n = 100) # 250000
{
# Generate some random data
AccountID <- sample(1:50, n, replace=T)
V1 <- sample(1:100, n, replace=T)
Month <- sample(1:12, n, replace=T)

m <- data.frame(AccountID, V1, Month)
m
}



fo <- function(X)
{
unique_ID <- unique(X$AccountID)
M_max <- numeric(length(unique_ID ))

for(i in seq_along(unique_ID))
{
    ss <- X$AccountID == unique_ID[i]
    M_max [i] <- X[ss,"Month"][which.max(X[ss,"V1"])]
}

# results:
# M_max
data.frame(unique_ID , M_max)
}


X <- make.data(1000000)
system.time(fo(X))
#   user  system elapsed 
#   2.32    0.33    2.70

我怀疑其中一些功能可能比您使用的功能更快（但值得测试时间）。

编辑： R 的新 JIT 可能会对您有所帮助（您可以在此处阅读更多相关信息：使用即时 (JIT) 编译器加速您的 R 代码） 我也尝试使用 JIT，但它并没有加快速度。

并行化你的循环也可能是值得的（但我现在不会进入它）。

如果时间不现实，可能会使用 data.table 包（但我没有使用它的经验），甚至可以使用 SQL 来完成它......

祝你好运，塔尔

更新：我使用了 nico 的示例，并将解决方案包装在函数中。时机绝对好，不需要更高级的解决方案......

score 1 · Accepted Answer

这在我使用 250000 行的笔记本电脑上几乎是瞬间完成的（而且它更干净）

# Generate some random data
AccountID <- sample(1:50, 250000, replace=T)
V1 <- sample(1:100, 250000, replace=T)
Month <- sample(1:12, 250000, replace=T)

m <- data.frame(AccountID, V1, Month)

# Aggregate the data by month
V1.per.month <- aggregate(m$V1, sum, by=list(Month = m$Month))

编辑：重新阅读我意识到我忘记考虑帐户的问题（双关语）

然而，这应该做

V1.per.month <- aggregate(m$V1, sum, 
             by=list(Month = m$Month, Account= m$AccountID))

时序图（误差线为 SD）。如您所见，每 100 万行需要大约 2.5 秒，我认为这是非常可以接受的。

每行数经过的时间

r - R中的慢循环，有什么建议可以加快速度吗？

5 回答 5

Related

Reference