r - 使用 ecdf() 和 plyr::percent_rank() 的不同百分位数

Question

我一直在尝试计算大量观察的百分位数。我遇到了两种不同的计算百分位数的方法。由于我正在处理面板数据集，因此我想对每个时间段的百分位数进行分组。为此，我使用这个Use dplyr::percent_rank() 来计算组内的百分位等级和这个问题Percentile for Each Observation w/r/t Grouping Variable。

现在的问题是，这两个命令之间的百分位数显然不同，我想知道两者是否“正确”。为了证明这一点：

library(data.table)
library(plyr)
years = c(2006, 2006, 2006, 2006, 2001, 2001, 2001, 2001, 2001)
scores = c(13, 65, 23, 34, 78, 56, 89, 98, 100)

dt <- data.table(years
                 , scores)

ddply(dt, .(years), transform, percentile = ecdf(scores)(scores)) 
ddply(dt, .(years), transform, percentile = round(percent_rank(scores), 4)) 
dt[, .( scores
      , ecdf.percentile = ecdf(scores)(scores)
      , p.rank.percentile = round(percent_rank(scores), 4) )
      , by = list(years)][order(years),]

可以看出，虽然它们非常相似，但它们是不同的：

   years scores ecdf.percentile p.rank.percentile
1:  2001     78            0.40            0.2500
2:  2001     56            0.20            0.0000
3:  2001     89            0.60            0.5000
4:  2001     98            0.80            0.7500
5:  2001    100            1.00            1.0000
6:  2006     13            0.25            0.0000
7:  2006     65            1.00            1.0000
8:  2006     23            0.50            0.3333
9:  2006     34            0.75            0.6667

r - 使用 ecdf() 和 plyr::percent_rank() 的不同百分位数

0 回答 0

Related

Reference