r - 如何用前面和后面的数据点平均 R 数据集中的异常值？

Question

我有一个大型数据集，并将异常值定义为高于 99 或低于第 1 个百分位数的值。

我想用它们之前和之后的数据点来取这些异常值的平均值，然后用新数据集中的平均值替换所有 3 个值。

如果有人知道如何做到这一点，我将非常感谢您的回复。

score 4 · Accepted Answer

如果您有一个索引列表指定向量中的异常值位置，例如使用：

out_idx = which(df$value > quan0.99)

您可以执行以下操作：

for(idx in out_idx) {
  vec[(idx-1):(idx+1)] = mean(vec[(idx-1):(idx+1)])
}

您可以将其包装在一个函数中，使带宽和函数成为可选参数：

average_outliers = function(vec, outlier_idx, bandwith, func = "mean") {
   # iterate over outliers
   for(idx in out_idx) {
    # slicing of arrays can be used for extracting information, or in this case,
    # for assiging values to that slice. do.call is used to call the e.g. the mean 
    # function with the vector as input.
    vec[(idx-bandwith):(idx+bandwith)] = do.call(func, out_idx[(idx-bandwith):(idx+bandwith)])
  }      
  return(vec)
}

允许您也使用median2 的带宽。使用此功能：

# Call average_outliers multiple times on itself,
# first for the 0.99 quantile, then for the 0.01 quantile.
vec = average_outliers(vec, which(vec > quan0.99))
vec = average_outliers(vec, which(vec < quan0.01))

或者：

vec = average_outliers(vec, which(vec > quan0.99), bandwith = 2, func = "median")
vec = average_outliers(vec, which(vec < quan0.01), bandwith = 2, func = "median")

使用 2 的带宽，并替换为中值。

r - 如何用前面和后面的数据点平均 R 数据集中的异常值？

1 回答 1

Related

Reference