performance - Using sapply vs. for to efficiently write to preallocated data structures

Question

Assume I have a preallocated data structure that I want to write into for the sake of performance vs. growing the data structure over time. First I tried this using sapply:

set.seed(1)
count <- 5
pre <- numeric(count)

sapply(1:count, function(i) {
  pre[i] <- rnorm(1)
})
pre
# [1] 0 0 0 0 0


for(i in 1:count) {
  pre[i] <- rnorm(1)
}
pre
# [1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884

I assume this is because the anonymous function in sapply is in a different scope (or is it environment in R?) and as a result the pre object isn't the same. The for loop exists in the same scope/environment and so it works as expected.

I've generally tried to adopt the R mechanisms for iteration with apply functions vs. for, but I don't see a way around it here. Is there something different I should be doing or a better idiom for this type of operation?

As noted, my example is highly contrived, I have no interested in generaring normal deviates. Instead my actual code is dealing with a 4 column 1.5 million row dataframe. Previously I was relying on growing and merging to get a final dataframe and decided to try to avoid merges and preallocate based on benchmarking.

score 7 · Accepted Answer

sapply不应该那样使用。它已经预先分配了结果。

无论如何，for 循环不太可能是导致性能下降的原因。这可能是因为您反复对 data.frame 进行子集化。例如：

set.seed(21)
N <- 1e4
d <- data.frame(n=1:N, s=sample(letters, N, TRUE))
l <- as.list(d)
set.seed(21)
system.time(for(i in 1:N) { d$n[i] <- rnorm(1); d$s <- sample(letters,1) })
#   user  system elapsed 
#   6.12    0.00    6.17 
set.seed(21)
system.time(for(i in 1:N) { l$n[i] <- rnorm(1); l$s <- sample(letters,1) })
#   user  system elapsed 
#   0.14    0.00    0.14 
D <- as.data.frame(l, stringsAsFactors=FALSE)
identical(d,D)
# [1] TRUE

因此，您应该遍历各个向量并在循环后将它们组合成一个 data.frame 。

score 3 · Accepted Answer

该apply系列不适用于产生副作用的任务，例如更改变量的状态。这些函数旨在简单地返回值，然后将其分配给变量。这与 R 部分赞同的功能范式是一致的。如果您按预期使用这些功能，预分配不会出现太多，这是它们吸引力的一部分。您可以轻松地做到这一点，而无需预先分配：p <- sapply(1:count, function(i) rnorm(1))。但是这个例子有点人为——p <- rnorm(5)是你会使用的。

如果您的实际问题与此不同，并且您遇到效率问题，请查看vapply. 它就像sapply，但允许您指定生成的数据类型，从而使其具有速度优势。如果这没有帮助，请查看软件包data.table或ff.

score 2 · Accepted Answer

是的，您实际上是在更改pre匿名函数的局部变量 a，该函数本身将返回最后一次评估的结果（长度为 1 的向量），因此sapply()将正确的解作为向量返回（因为它累积了单个长度为 1 的向量）但它不会改变pre全局工作区中的。

您可以使用<<-运算符解决此问题：

set.seed(1)
count <- 5
pre <- numeric(count)

sapply(1:count, function(i) {
  pre[i] <<- rnorm(1)
})
> pre
[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

这已经改变了pre，但出于各种原因，我会避免这样做。

在这种情况下，我认为在这种情况下预分配不会有太多pre好处sapply()。

此外，对于这个例子，两者都非常低效。只需rnorm()生成count随机数。但我想这个例子只是为了说明这一点？

score 1 · Accepted Answer

我不确定你在问什么。在这种情况下，sapply 的传统习语是

pre <- sapply( 1:count, function(x) rnorm(1) )

在那里，您根本不必预先分配，但您不受限制使用预先分配的变量。

我猜如果你提出你想要改变的实际循环，事情会更清楚。你说你遇到了性能问题，你可能会在这里得到一个可以真正优化事情的答案。有一些回答者喜欢这样的挑战。

听起来你有一个很长的函数或循环。apply 系列函数主要用于表达性，并允许您在混合矢量化函数和无法实现的事物时使其更加清晰。与矢量化函数混合的多个小sapply调用比 R 中的一个大循环快得多。

performance - Using sapply vs. for to efficiently write to preallocated data structures

4 回答 4

Related

Reference