r - 根据“不足”的查找表赋值

Question

我必须查找一些分数并根据固定的查找表分配百分位值。

我已经尝试解决这个问题一段时间了，我已经阅读了这个和这个SO 线程，但没有解决我的问题。我的问题是原始分数可能大于查找表中的值，在这种情况下，规定了最大的百分位值。

我有一个这样的查找表，

lookup <- structure(list(Percentile = c(99, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30, 25, 20, 15, 10, 5, 1), ACB = c(24, 19, 18, 17, 16, NA, 15, NA, 14, NA, NA, 13, NA, NA, NA, 12, NA, 11, 10, 9, 7), DFG = c(49, 39, 36, 33, 31, 30, 29, 28, 27, 26, 25, NA, 24, 23, 22, 21, 20, 19, 17, 14, 12), EIH = c(35, 30, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, NA, 14, NA, 13, 12, NA), GKJ = c(49, 39, 36, 33, 31, 30, 29, 28, 27, 26, 25, NA, 24, 23, 22, 21, 19, 18, 17, 15, 14), Total = c(112, 99, 91, 86, 82, 79, 76, 75, 73, 71, 69, 67, 66, 65, 63, 61, 59, 55, 51, 46, 39)), .Names = c("Percentile", "ACB", "DFG", "EIH", "GKJ", "Total"), row.names = c("99+", "95", "90", "85", "80", "75", "70", "65", "60", "55", "50", "45", "40", "35", "30", "25", "20", "15", "10", "5", "1"), class = "data.frame")
lookup
    Percentile ACB DFG EIH GKJ Total
99+         99  24  49  35  49   112
95          95  19  39  30  39    99
90          90  18  36  27  36    91
85          85  17  33  26  33    86
80          80  16  31  25  31    82
75          75  NA  30  24  30    79
70          70  15  29  23  29    76
65          65  NA  28  22  28    75
60          60  14  27  21  27    73
55          55  NA  26  20  26    71
50          50  NA  25  19  25    69
45          45  13  NA  18  NA    67
40          40  NA  24  17  24    66
35          35  NA  23  16  23    65
30          30  NA  22  15  22    63
25          25  12  21  NA  21    61
20          20  NA  20  14  19    59
15          15  11  19  NA  18    55
10          10  10  17  13  17    51
5            5   9  14  12  15    46
1            1   7  12  NA  14    39

而且，一些看起来像这样的原始数据，

rawS_1 <- structure(list(ACB = 28, DFG = 39, EIH = 31, GKJ = NA_real_, Total = NA_real_), .Names = c("ACB", "DFG", "EIH", "GKJ", "Total"), row.names = "RawScore for ID 1", class = "data.frame")
rawS_1
                  ACB DFG EIH GKJ Total
RawScore for ID 1  28  39  31  NA    NA

rawS_2 <- structure(list(ACB = 29, DFG = 51, EIH = 56, GKJ = 60, Total = 169), .Names = c("ACB", "DFG", "EIH", "GKJ", "Total"), row.names = "RawScore for ID 2", class = "data.frame")
rawS_2
                  ACB DFG EIH GKJ Total
RawScore for ID 2  29  51  56  60   169

而且，这就是我想做的，

                  ACB DFG EIH GKJ Total
RawScore for ID 1  12  39  19  NA    NA
Percentile, ID 1   25  95  50  NA    NA
                  ACB DFG EIH GKJ Total
RawScore for ID 2  29  51  56  60   169
Percentile, ID 2   99  99  99  99    99

我尝试merge()使用all.x = TRUEand suffixes = c(".x",".y"))，但我不断得到我不想要的东西，我们将不胜感激。

score 2 · Accepted Answer

与其将其视为合并，我认为您最好将其视为创建函数的问题：您想要一个在给定（例如） ACB 的原始值时返回百分位数的函数。幸运的是，R 有一个函数旨在从一个数字表中生成一个函数：approxfun.

以下代码用于lapply为每一列创建一个函数，然后显示如何调用新函数：

vars <- names(lookup)[-1]
lookup_funs <- lapply(vars, function(var) {
  df <- data.frame(x = lookup[[var]], y = lookup$Percentile)
  df <- df[complete.cases(df), ]
  approxfun(df$x, df$y, "constant", rule = 2)
})
names(lookup_funs) <- vars

lookup_funs$ACB(c(12, 29))
lookup_funs$Total(169)

score 1 · Accepted Answer

基本策略是使用!is.na(vec)索引值和感知向量。下面来看一个案例。对于 ACB 的 11 输入，您更喜欢哪一个？

> rev(lookup$Percentile)[!is.na(lookup$ACB)][
                findInterval( 11, c(-Inf,rev(lookup$ACB[!is.na(lookup$ACB)]), Inf))]
[1] 20
> rev(lookup$Percentile)[!is.na(lookup$ACB)][
                findInterval( 11, c(-Inf,rev(lookup$ACB[!is.na(lookup$ACB)]), Inf))-1]
[1] 15

对于一行数据，这可以帮助您获得大部分信息：

> for(i in names(rawS_1) ) {print(rawS_1[i]); print(rev(lookup$Percentile)[ !is.na(lookup[[i]]) ][ findInterval( rawS_1[i], c( rev( lookup[[i]][ !is.na(lookup[[i]] )]) ) )] )}
                  ACB
RawScore for ID 1  28
[1] 99
                  DFG
RawScore for ID 1  39
[1] 95
                  EIH
RawScore for ID 1  31
[1] 90
                  GKJ
RawScore for ID 1  NA
[1] NA
                  Total
RawScore for ID 1    NA
[1] NA

您确实会通过从标度高端的索引中减去 1 来进行索引溢出，因此您可能应该在确定要查看的结果后在查找向量上添加一个额外的元素。

for(i in names(rawS_2) ) {print(rawS_2[i]); print(rev(lookup$Percentile)[ !is.na(lookup[[i]]) ][ findInterval( rawS_2[i], c( rev( lookup[[i]][ !is.na(lookup[[i]] )]) ) )] )}
                  ACB
RawScore for ID 2  29
[1] 99
                  DFG
RawScore for ID 2  51
[1] 99
                  EIH
RawScore for ID 2  56
[1] 95
                  GKJ
RawScore for ID 2  60
[1] 99
                  Total
RawScore for ID 2   169
[1] 99

r - 根据“不足”的查找表赋值

2 回答 2

Related

Reference