r - 计算大数据的分位数

Question

我有大约 300 个文件，每个文件包含 1000 个时间序列实现（每个文件约 76 MB）。

我想从完整的 300000 个实现中计算每个时间步的分位数（0.05、0.50、0.95）。

我无法将实现合并到一个文件中，因为它会变得太大。

这样做最有效的方法是什么？

每个矩阵都是通过运行模型生成的，但是这里是一个包含随机数的示例：

x <- matrix(rexp(10000000, rate=.1), nrow=1000)

score 4 · Accepted Answer

至少有三个选项：

你确定它必须来自全套？10% 的样本在这里应该是一个非常非常好的近似值。
300k 个元素并不是向量那么大，但是 300k x 100+ 列矩阵很大。仅将您需要的列拉入内存而不是整个矩阵（如有必要，可以在每一列上重复）。
按顺序进行，可能与较小的样本结合使用，以帮助您在正确的范围内开始。对于第 5 个百分位数，您只需要知道有多少项目高于当前猜测值，有多少项目低于当前猜测值。所以像：
1. 取一个 1% 的样本，找出它的第 5 个百分位数。上下跳跃一些公差，这样您就可以确定确切的第 5 个百分位在该范围内。
2. 分块读入矩阵。对于每个块，计算范围以上和范围以下的观察次数。然后保留该范围内的所有观测值。
3. 当您阅读最后一个块时，您现在拥有三条信息（上面的计数，下面的计数，内部的观察向量）。获取分位数的一种方法是对整个向量进行排序并找到第 n 个观测值，您可以使用上述信息来做到这一点：对范围内的观测值进行排序，然后找到第 (n-count_below) 个。

编辑：（3）的示例。

请注意，我不是算法设计冠军，几乎可以肯定有人为此设计了更好的算法。此外，这种实现并不是特别有效。如果速度对您很重要，请考虑 Rcpp，或者甚至为此进行更优化的 R。制作一堆列表然后从中提取值并不是那么聪明，但是这种方式很容易制作原型，所以我就采用了。

library(plyr)

set.seed(1)

# -- Configuration -- #
desiredQuantile <- .25

# -- Generate sample data -- #

# Use some algorithm (sampling, iteration, or something else to come up with a range you're sure the true value lies within)
guessedrange <- c( .2, .3 )
# Group the observations to correspond to the OP's files
dat <- data.frame( group = rep( seq(100), each=100 ), value = runif(10000) )

# -- Apply the algorithm -- #

# Count the number above/below and return the values within the range, by group
res <- dlply( dat, .( group ), function( x, guessedrange ) {
  above <- x$value > guessedrange[2]
  below <- x$value < guessedrange[1]
  list(
    aboveCount  = sum( above ),
    belowCount = sum( below ),
    withinValues = x$value[ !above & !below ]
  )
}, guessedrange = guessedrange )
# Exract the count of values below and the values within the range
belowCount <- sum( sapply( res, function(x) x$belowCount ) )
belowCount
withinValues <- do.call( c, sapply( res, function(x) x$withinValues ) )
str(withinValues)
# Count up until we find the within value we want
desiredQuantileCount <- floor( desiredQuantile * nrow(dat) ) #! Should fix this so it averages when there's a tie
sort(withinValues)[ desiredQuantileCount - belowCount + 1 ]
# Compare to exact value
quantile( dat$value, desiredQuantile )

最后，该值与确切版本略有不同。我怀疑我被一个或一些同样愚蠢的解释所转移，但也许我错过了一些基本的东西。

r - 计算大数据的分位数

1 回答 1

Related

Reference