r - 如何使用 R 识别和删除 data.frame 中的异常值？

Question

我有一个具有多个异常值的数据框。我怀疑这些 ouliers 产生的结果与预期不同。

我尝试使用这个技巧，但它没有用，因为我仍然有非常不同的值：https ://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/

我尝试了使用rstatix包的解决方案，但我无法从我的 data.frame 中删除异常值

library(rstatix)
library(dplyr)

df <- data.frame(
  sample = 1:20,
  score = c(rnorm(19, mean = 5, sd = 2), 50))

View(df)

out_df<-identify_outliers(df$score)#identify outliers

df2<-df#copy df

df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2

View(df2)

score 1 · Accepted Answer

identify_outliers期望一个 data.frame 作为输入，即用法是

识别异常值（数据，...，变量 = NULL）

在哪里

... - 一个不带引号的表达式（或变量名）。用于选择感兴趣的变量。替代参数变量。

df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)

score 1 · Accepted Answer

根据经验，高于 Q3 + 1.5xIQR 或低于 Q1 - 1.5xIQR 的数据点被视为异常值。因此，您只需要识别它们并删除它们。我不知道如何使用依赖 rstatix 来做到这一点，但是可以按照以下示例实现基本 R：

# Generate a demo data
set.seed(123)
demo.data <- data.frame(
                         sample = 1:20,
                         score = c(rnorm(19, mean = 5, sd = 2), 50),
                         gender = rep(c("Male", "Female"), each = 10)
                        )
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score)) 

# remove them from your dataframe
df2 = demo.data[-outliers,]

执行一个更酷的函数，返回异常值的索引：

get_outliers = function(x){
   which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}

outliers <- get_outliers(demo.data$score)


df2 = demo.data[-outliers,]

r - 如何使用 R 识别和删除 data.frame 中的异常值？

2 回答 2

Related

Reference