r - 如何用缺失值前后的平均值填充向量中的缺失值

Question

目前我正在尝试在 R 中的向量中估算值。估算的条件是。

查找所有 NA 值
然后检查它们之前和之后是否有现有值
还要检查 NA 之后的值是否大于 NA 之前的值
如果满足条件，计算取前后值的平均值。
用推算值替换 NA 值

# example one
input_one = c(1,NA,3,4,NA,6,NA,NA)

# example two
input_two = c(NA,NA,3,4,5,6,NA,NA)

# example three
input_three = c(NA,NA,3,4,NA,6,NA,NA)

我开始编写代码来检测可以估算的值。但是我遇到了以下问题。

# incomplete function to detect the values
sapply(split(!is.na(input[c(rbind(which(is.na(c(input)))-1, which(is.na(c(input)))+1))]), 
             rep(1:(length(!is.na(input[c(which(is.na(c(input)))-1, which(is.na(c(input)))+1)]))/2), each = 2)), all)

然而，这仅检测到可能是可推算的 NA，并且仅适用于示例一。它是不完整的，不幸的是超级难以阅读和理解。

对此的任何帮助将不胜感激。

score 1 · Accepted Answer

我们可以为此使用dplyrslag和lead函数：

input_three = c(NA,NA,3,4,NA,6,NA,NA)

library(dplyr)
ifelse(is.na(input_three) & lead(input_three) > lag(input_three),
       (lag(input_three)  + lead(input_three))/ 2,
       input_three)

返回：

[1] NA NA  3  4  5  6 NA NA

编辑

解释：

我们使用ifelsewhich 是if. 即，其中的所有内容都ifelse将应用于向量的每个元素。首先，我们测试元素是否是NA，以及下一个元素是否大于前一个。要获取前面和后面的元素，我们可以使用dplyr lead和lag函数：

lag向右偏移一个向量（默认为 1 步）：

lag(1:5)

回报：

[1] NA  1  2  3  4

lead向左偏移一个向量：

lead(1:5)

回报：

[1]  2  3  4  5 NA

现在到“测试”子句ifelse：

is.na(input_three) & lead(input_three) > lag(input_three)

返回：

[1]    NA    NA FALSE FALSE  TRUE FALSE    NA    NA

然后如果ifelse子句计算为TRUE我们想要返回前一个和后一个元素的总和除以 2，否则返回原始元素

score 1 · Accepted Answer

imputeTS这是使用该库的示例。它考虑NA了序列中的多个，确保在下一个有效观察值大于最后一个有效观察值时计算平均值，并且NA在开始和结束时也忽略。

library(imputeTS)
myimpute <- function(series) {
    # Find where each NA is
    nalocations <- is.na(series)
    # Find the last and the previous observation for each row
    last1 <- lag(series)
    next1 <- lead(series)
    # Carry forward the last and next observations over sequences of NA
    # Each row will then get a last and next that can be averaged
    cflast <- na_locf(last1, na_remaining = 'keep')
    cfnext <- na_locf(next1, option = 'nocb', na_remaining = 'keep')
    # Make a data frame 
    df <- data.frame(series, nalocations, last1, cflast, next1, cfnext)
    # Calculate the mean where there is currently a NA
    # making sure that the next is greater than the last
    df$mean <- ifelse(df$nalocations, ifelse(df$cflast < df$cfnext, (df$cflast+df$cfnext)/2, NA), NA)
    imputedseries <- ifelse(df$nalocations, ifelse(!is.na(df$mean), df$mean, NA), series)
    #list(df,  imputedseries) # comment this in and return it to see the intermediate data frame for debugging
    imputedseries
}
myimpute(c(NA,NA,3,4,NA,NA,6,NA,NA,8,NA,7,NA,NA,9,NA,11,NA,NA))

# [1] NA NA  3  4  5  5  6  7  7  8 NA  7  8  8  9 10 11 NA NA

score 0 · Accepted Answer

这是使用的替代方法zoo::rollapply()：

library(zoo)

fill_sandwiched_na <- function(f) rollapply(f, 3, FUN = function(x) {
  y <- mean(x[-2]); if(is.na(y)) x[2] else y
}, fill = NA, partial = TRUE)

fill_sandwiched_na(input_one)
[1]  1  2  3  4  5  6 NA NA

fill_sandwiched_na(input_two)
[1] NA NA  3  4  5  6 NA NA

fill_sandwiched_na(input_three)
[1] NA NA  3  4  5  6 NA NA

score 0 · Accepted Answer

imputeTS包中还有用于估算移动平均线的na_ma功能。

在您的情况下，这将具有以下设置：

na_ma(x, k = 1, weighting = "simple")

k = 1（表示考虑 NA 之前的 1 个值和 NA 之后的 1 个值）
weighting = "simple"（计算这两个值的平均值）

这可以很容易地应用，基本上只有 1 行代码：

library(imputeTS)
na_ma(yourData, k = 1, weighting = "simple")

您还可以选择在 NA 之前和之后考虑更多值，例如 k=3。如果您考虑每边超过 1 个值，有趣的功能是可以选择不同的权重，例如权重 =“线性”权重在算术级数中减少（线性加权移动平均线） - 意味着它们的值离得越远NA 的影响越小。

r - 如何用缺失值前后的平均值填充向量中的缺失值

4 回答 4

编辑

Related

Reference