1

What is the best R idiom to compute sums of elements within a sliding window?

Conceptually I want the following:

for (i in 1:(length(input) - lag + 1))
  output[i] <- sum(input[i:(i + lag - 1)])

In other words, every output element should be the sum of a fixed number of input elements (called lag here), resulting in an appropriately shorter result vector. I know that I can theoretically write this as

output = diff(cumsum(c(0, input)), lag = lag)

but I am worried about the precision here. I have a setup in mind where all the values would have the same sign, and the vectors would be pretty large. Summing up the values up front might lead to prety large numbers, so there won't be many significant digits left for the individual differences. This feels bad.

I would imagine that it should be possible to do better than that, at least when using a single function instead of two. An implementation could maintain the current sum, aleays adding one element and subtracting another for each iteration. Since that would still accumulate rounding errors along the way, one could perform the computations separately from both ends, and if the results at the center were too far off, compute a fresh result from the center and thus increase precision in a divide-and-conquer approach.

Do you know of any implementation which does anything like this?
Or is there a reason why this won't work as I think it should?
Or perhaps a reason why the diff(cumsum(…)) approach isn't as bad as it seems?


Edit: I had some off-by-one mistakes in my above formulations, making them inconsistent. Now they seem to agree on test data. lag should be the number of elements summed, and I'd expect a shorter vector as a result. I'm not dealing with time series objects, so absolute time alignment is not that relevant to me.

I had seen some noisy-looking things in my real data, which I had assumed to be due to such numeric problems. Since several different approaches to compute these values, using different suggestions from answers and comments, still led to similar results, it might be that the strangeness of my data is not in fact due to numeric issues.

So in order to evaluate answers, I used the following setup:

library(Rmpfr)
library(caTools)

len <- 1024*1024*8
lag <- 3
precBits <- 128
taillen <- 6

set.seed(42) # reproducible
input <- runif(len)
input <- input + runif(len, min=-1e-9, max=1e-9) # use >32 bits

options(digits = 22)

# Reference: sum everything separately using high precision.
output <- mpfr(rep(0, taillen), precBits = precBits)
for (i in 1:taillen)
  output[i] <- sum(mpfr(input[(len-taillen+i-lag+1):(len-taillen+i)],
                        precBits=precBits))
output

addResult <- function(data, name) {
  n <- c(rownames(resmat), name)
  r <- rbind(resmat, as.numeric(tail(data, taillen)))
  rownames(r) <- n
  assign("resmat", r, parent.frame())
}

# reference solution, rounded to nearest double, assumed to be correct
resmat <- matrix(as.numeric(output), nrow=1)
rownames(resmat) <- "Reference"

# my original solution
addResult(diff(cumsum(c(0, input)), lag=lag), "diff+cumsum")

# filter as suggested by Matthew Plourde
addResult(filter(input, rep(1, lag), sides=1)[lag:length(input)], "filter")

# caTools as suggested by Joshua Ulrich
addResult(lag*runmean(input, lag, alg="exact", endrule="trim"), "caTools")

The result for this looks as follows:

                               [,1]                    [,2]
Reference   2.380384891521345469556 2.036472557725210297264
diff+cumsum 2.380384892225265502930 2.036472558043897151947
filter      2.380384891521345469556 2.036472557725210741353
caTools     2.380384891521345469556 2.036472557725210741353
                               [,3]                    [,4]
Reference   1.999147923481302324689 1.998499369297661143463
diff+cumsum 1.999147923663258552551 1.998499369248747825623
filter      1.999147923481302324689 1.998499369297661143463
caTools     1.999147923481302324689 1.998499369297661143463
                               [,5]                    [,6]
Reference   2.363071143676507723796 1.939272651346203080180
diff+cumsum 2.363071143627166748047 1.939272651448845863342
filter      2.363071143676507723796 1.939272651346203080180
caTools     2.363071143676507723796 1.939272651346203080180

The result indicates that diff+cumsum is still surprisingly accurate. (It appeared even more accurate before I thought of adding that second runif vector.) filter and caTools both are almost indistinguishable from the perfect result. As for performance, I haven't tested that (yet). I only know that the Rmpfr cumsum with 128 bits was slow enough that I didn't feel like waiting on its completion. Feel free to edit this question if you have a performance benchmark, or new suggestions to add to the comparison.

4

2 回答 2

1

此答案基于Joshua Ulrich评论

caTools提供了一个函数runmean,它计算我的部分总和,除以窗口大小(或者更确切地说是相关窗口中非NA元素的数量)。引用其文档:

runmean(..., alg="exact")函数的情况下,使用特殊算法(参见参考部分)来确保舍入误差不会累积。结果runmeanfilter(x, rep(1/k,k))runmean(..., alg="C")函数更准确。

注意

函数runmean(..., alg="exact")基于 Vadim Ogranovich 的代码,它基于 Python 代码(参见最后的参考资料),由 Gabor Grothendieck 指出。

参考资料

该代码使用一系列双精度浮点值存储当前窗口的总和,其中较小的值表示较大元素引起的舍入误差。因此,即使输入数据在单遍中处理,在每一步添加一个元素并删除另一个元素,也不应该有任何舍入误差的累积。最终结果应该与双精度算术可以表示的一样精确。

不过,除此之外的算法exact似乎会产生一些不同的结果,所以我可能不会建议这些。

源代码包含一个函数似乎有点遗憾runsum_exact,但它被注释掉了。获得平均值的除法,加上返回总和的乘法,将引入本可以避免的舍入误差。对此CHANGES文件说:

11) caTools-1.11(2010 年 12 月)

  • 完全退役的 runsum.exact 有一段时间没有工作,请改用带有“exact”选项的 runmean。

目前(2012 年 5 月 22 日的 caTools 版本 1.14)该软件包似乎是孤立的。

于 2013-06-06T09:40:18.513 回答
1

我不能说这是否是这样的实现,但有

filter(input, sides=2, filter=rep(1, lag+1))

查看 to 的主体filter,看起来辛苦的工作被传递给了 C 例程 C_rfilter,所以也许您可以检查一下它是否满足您的精度要求。否则,@JoshuaUlrich 的建议听起来很有希望。

于 2013-05-30T19:33:47.157 回答