performance - 如何用每个data.frame行数千次的vecorization替换for循环？

Question

关于 R 和 - 更重要的是 - 向量化，我仍然耳后很湿，我无法理解如何加速下面的代码。

for 循环通过对每个种子应用随机概率来计算具有不同种子产生植物密度的几个路段的种子数量。由于我的真实数据框有大约 200k 行并且种子数高达 300k/segment，因此在我当前的机器上使用下面的示例将需要几个小时。

#Example data.frame
df <- data.frame(Density=c(0,0,0,3,0,120,300,120,0,0))
#Example SeedRain vector
SeedRainDists <- c(7.72,-43.11,16.80,-9.04,1.22,0.70,16.48,75.06,42.64,-5.50)

#Calculating the number of seeds from plant densities
df$Seeds <- df$Density * 500

#Applying a probability of reaching the road for every seed
df$SeedsOnRoad <- apply(as.matrix(df$Seeds),1,function(x){
    SeedsOut <- 0
    if(x>0){
        #Summing up the number of seeds reaching a certain distance
        for(i in 1:x){
            SeedsOut <- SeedsOut +
                ifelse(sample(SeedRainDists,1,replace=T)>40,1,0)
        }
    }
    return(SeedsOut)
})

如果有人可以提示我如何用矢量化代替循环 - 或者首先如何更好地组织数据以提高性能 - 我将非常感激！

编辑：罗兰的回答表明我可能过度简化了这个问题。在 for 循环中，我从另一位作者记录的距离分布中提取随机值（这就是为什么我不能在这里提供数据的原因）。添加了一个示例向量，其中包含 SeedRain 距离的可能值。

score 5 · Accepted Answer

这应该做同样的模拟：

df$SeedsOnRoad2 <- sapply(df$Seeds,function(x){
  rbinom(1,x,0.6)
})



#   Density  Seeds SeedsOnRoad SeedsOnRoad2
#1        0      0           0            0
#2        0      0           0            0
#3        0      0           0            0
#4        3   1500         892          877
#5        0      0           0            0
#6      120  60000       36048        36158
#7      300 150000       90031        89875
#8      120  60000       35985        35773
#9        0      0           0            0
#10       0      0           0            0

score 4 · Accepted Answer

一种选择是一次性为每行的sample()所有内容生成。Seedsdf

set.seed(1)在基于循环的代码之前使用我得到：

> df
   Density  Seeds SeedsOnRoad
1        0      0           0
2        0      0           0
3        0      0           0
4        3   1500         289
5        0      0           0
6      120  60000       12044
7      300 150000       29984
8      120  60000       12079
9        0      0           0
10       0      0           0

如果我这样做，我会在很短的时间内得到相同的答案：

set.seed(1)
tmp <- sapply(df$Seeds, 
              function(x) sum(sample(SeedRainDists, x, replace = TRUE) > 40)))

> tmp
 [1]     0     0     0   289     0 12044 29984 12079     0     0

为了比较：

df <- transform(df, GavSeedsOnRoad = tmp)
df

> df
   Density  Seeds SeedsOnRoad GavSeedsOnRoad
1        0      0           0              0
2        0      0           0              0
3        0      0           0              0
4        3   1500         289            289
5        0      0           0              0
6      120  60000       12044          12044
7      300 150000       29984          29984
8      120  60000       12079          12079
9        0      0           0              0
10       0      0           0              0

这里需要注意的点是：

如果函数是矢量化的，或者可以通过一次调用生成整个最终结果，请尽量避免在循环中重复调用函数。在这里，您sample() Seeds为的每一行调用时间df，每个调用从中返回一个样本SeedRainDists。在这里，我进行了一次sample()调用，询问样本大小Seeds，对于每一行df- 因此我调用了sample10 次，您的代码调用了 271500 次。
即使您必须在循环中重复调用一个函数，也要从循环中删除任何可以在循环完成后对整个结果执行的矢量化操作。这里的一个例子是你的累积SeedsOut，它调用+()了很多次。

最好将每个收集SeedsOut在一个向量中，然后在循环外sum()收集那个向量。例如
```
SeedsOut <- numeric(length = x)
for(i in seq_len(x)) {
  SeedsOut[i] <- ifelse(sample(SeedRainDists,1,replace=TRUE)>40,1,0)
}
sum(SeedOut)
```
请注意，R 将逻辑视为数字0s 或1在任何数学函数中使用的 s。因此
```
sum(ifelse(sample(SeedRainDists, 100, replace=TRUE)>40,1,0))
```
和
```
sum(sample(SeedRainDists, 100, replace=TRUE)>40)
```
如果使用相同的set.seed().

可能有一种更高级的采样方式，需要更少的调用sample()（确实有，sample(SeedRainDists, sum(Seeds), replace = TRUE) > 40但是你需要注意为每一行选择该向量的正确元素df- 不难，只是有点麻烦），但是什么我显示可能足够高效？

performance - 如何用每个data.frame行数千次的vecorization替换for循环？

2 回答 2

Related

Reference