r - calculating the outliers in R

Question

I have a data frame like this:

x

Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD Mean
A     100         50           40        NA         30       60  80

I like to perform calculation on each cell to the mean and sd to calculate the outliers. For example,

abs(x-Mean) > 3*SD

x$count<-c(1) (increment this value if the above condition is met).

I am doing this to check the anomaly in my data set. If I know the column names, it would be easier to do the calculations, but number of columns will vary. Some cells may have NA in them.

I like to subtrack mean from each cell, and I tried this

x$diff<-sweep(x, 1, x$Mean, FUN='-')

does not seem to be working, any ideas?

score 47 · Accepted Answer

使用以下方法获取您的 IQR（四分位数范围）和下/上四分位数：

lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)

计算温和异常值的界限：

mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)

在这些值之外的任何数据点（>mild.threshold.upper 或 <mild.threshold.lower）都是温和异常值

要检测极端异常值，请执行相同的操作，但要乘以 3：

extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)

在这些值之外的任何数据点（>extreme.threshold.upper 或 <extreme.threshold.lower）都是极端异常值

希望这可以帮助

编辑：正在访问 50%，而不是 75%

score 9 · Accepted Answer

我已经使用上面@by0 的答案创建了一个自动删除异常值的函数。这是函数和一些示例代码：

# generate 10 random numbers and 2 'outlier' numbers
testData <- c(-42,rnorm(10),42)

# show the numbers
testData

# define a function to remove outliers
FindOutliers <- function(data) {
  lowerq = quantile(data)[2]
  upperq = quantile(data)[4]
  iqr = upperq - lowerq #Or use IQR(data)
  # we identify extreme outliers
  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)
  result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
}

# use the function to identify outliers
temp <- FindOutliers(testData)

# remove the outliers
testData <- testData[-temp]

# show the data with the outliers removed
testData

score 4 · Accepted Answer

我看到你问了一些关于按行做事的问题。你应该避免这种情况。R 遵循列代表变量和行代表观察的概念。许多功能都根据这个概念进行了优化。如果您需要将宽输出或转置输出到文件，您可以在写入文件之前重新排列数据。

我假设您的数据实际上看起来如问题所示，但您有不止一行。

df <- read.table(text="Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD 

Mean
A     100         50           40        NA         30       60  80
B     200         40           5         8          NA       NA  NA",check.names = FALSE,header=TRUE)

#needed because one date appears twice
df <- df[,]

#reshape the data
library(reshape2)
df <- melt(df,id="Team")
names(df)[2] <- "Date"

#remove the SD and Mean
df <- df[!df$Date %in% c("SD","Mean"),]

#function to detect outliers
outfun <- function(x) {
  abs(x-mean(x,na.rm=TRUE)) > 3*sd(x,na.rm=TRUE)
}

#test if function works
outfun(c(200,rnorm(10)))

#use function over all data
df3$outlier.all <- outfun(df3$value)

#apply function for each team 
library(plyr)
df3 <- ddply(df3,.(Team),transform,outlier.team=outfun(value))

结果：

           Date Team value outlier.all outlier.team
1    01/01/2012    A   100       FALSE        FALSE
2    01/02/2012    A    50       FALSE        FALSE
3    01/03/2012    A    40       FALSE        FALSE
4  01/01/2012.1    A    NA          NA           NA
5    01/04/2012    A    30       FALSE        FALSE
6    01/01/2012    B   200       FALSE        FALSE
7    01/02/2012    B    40       FALSE        FALSE
8    01/03/2012    B     5       FALSE        FALSE
9  01/01/2012.1    B     8       FALSE        FALSE
10   01/04/2012    B    NA          NA           NA

score 2 · Accepted Answer

以下公式可用于确定哪些值是异常值：

upper.outlier.calc <- function(x.var, df){
  with(df, quantile(x.var, 0.75) + (1.5 * (quantile(x.var, 0.75) - quantile(x.var, 0.25))))
}

lower.outlier.calc <- function(x.var, df){
  with(df, quantile(x.var, 0.25) - (1.5 * (quantile(x.var, 0.75) - quantile(x.var, 0.25))))
}

score 1 · Accepted Answer

查看我最复杂的功能。它具有三种方法（z mad，iqr），以及对异常值的不同处理（删除，或替换）。可以使用绘图并且可以进行黑客攻击（尝试不同的方法或阈值）

看例子：

set.seed(1234)
x = rnorm(10)
ez.outlier(iris,'Sepal.Length',fill='null',hack=T,cutoff=c(1,2,3),plot=T)

#' univariate outlier cleanup
#' @description univariate outlier cleanup
#' @param x a data frame or a vector
#' @param col colwise processing
#' \cr        col name
#' \cr        if x is not a data frame, col is ignored
#' \cr        could be multiple cols
#' @param method z score, mad, or IQR (John Tukey)
#' @param cutoff abs() > cutoff will be treated as outliers. Default/auto values (i.e. if NA):
#' \cr z 95% of values fall within 1.96, qnorm(0.025,lower.tail=F), or 3
#' \cr mad 2.5, which is the standard recommendation, or 5.2
#' \cr iqr 1.5
#' \cr if multiple values specified, use the first one (an exception is hack=T, during which method and cutoff same length or scalar)
#' @param hack call mapply to try all method and cutoff (same length or scalar, ie, different methods with 
#' corresponding cutoff, or same method with different cutoff).
#' @param plot boxplot and hist before and after outlier processing.
#' @param fillout how to process outlier, fill with na, mean, median (columnwise for data frame), or 
#' null --> remove outlier (only for vector or df with single col specified)
#' @return returns a new data frame or vector. If hack=T, returns nothings
#' @note univariate outlier approach
#' The Z-score method relies on the mean and standard deviation of a group of data to measure central
#' tendency and dispersion. This is troublesome, because the mean and standard deviation are highly
#' affected by outliers – they are not robust. In fact, the skewing that outliers bring is one of the
#' biggest reasons for finding and removing outliers from a dataset!
#' Another drawback of the Z-score method is that it behaves strangely in small datasets – in fact,
#' the Z-score method will never detect an outlier if the dataset has fewer than 12 items in it.
#' \cr
#' \cr
#' Median absolute deviation, modified z-score. The median and MAD are robust measures of central tendency and dispersion, respectively.
#' \cr
#' \cr
#' Interquartile range method is that, like the modified Z-score method, it uses a robust measure of dispersion.
#' \cr
#' @examples
#' set.seed(1234)
#' x = rnorm(10)
#' iris %>% ez.outlier('Sepal.Length',fill='null',hack=T,plot=T)
#' @export
ez.outlier = function(x, col=NULL, method=c('z','mad','iqr'), cutoff=NA, fillout=c('na','null','mean','median'), hack=FALSE, plot=FALSE, na.rm=TRUE, print2scr=TRUE) {
    # https://datascienceplus.com/rscript/outlier.R
    # https://cran.r-project.org/web/packages/outliers/index.html
    # https://rpubs.com/hauselin/outliersDetect

    if (hack==T){
            # here for programming reason, for mapply,
            # cutoff could not be NULL, use NA, because length(NULL)=0, but length(NA)=1
            mapply(ez.outlier,method=method,cutoff=cutoff,MoreArgs=list(x=x,col=col,hack=F,plot=plot,fillout=fillout,na.rm=na.rm,print2scr=print2scr),SIMPLIFY=F,USE.NAMES=F)
            cat('Hack done! No actual data returned.\n')
            return(invisible(NULL))
    }

 method = match.arg(method); fillout =fillout[1]; cutoff=cutoff[1]

    if (!is.data.frame(x)) {
        # todropna is a workaround for data frame with single col passed in

        x.bak.plot = x; x.replace.na = x; oldNAs = sum(is.na(x.replace.na))
        if (fillout=='na' | fillout=='todropna') {
            replacement = NA
        } else if (fillout=='mean') {
            replacement = mean(x, na.rm=na.rm)
        } else if (fillout=='median') {
            replacement = median(x, na.rm=na.rm)
        } else if (fillout=='null') {
            replacement = NULL
        }

        if (method=='z'){
            if(is.na(cutoff)) cutoff = qnorm(0.025,lower.tail=F)
            absz = abs((x - mean(x, na.rm=na.rm))/sd(x, na.rm=na.rm))
            if (!is.null(replacement)) {
                x[absz > cutoff] <- replacement
                } else {
                    # if nothing above cutoff, x is untouched
                    if (length(which(absz > cutoff)) > 0) {
                        x = x[-which(absz > cutoff)]
                    }
                }
            x.replace.na[absz > cutoff] <- NA
        } else if (method=='mad'){
            if(is.na(cutoff)) cutoff = 2.5
            absmad <- abs((x - median(x, na.rm=na.rm))/mad(x, na.rm=na.rm))
            if (!is.null(replacement)) {
                x[absmad > cutoff] <- replacement
                } else {
                    if (length(which(absmad > cutoff)) > 0) {
                        x = x[-which(absmad > cutoff)]
                    }
                }
            x.replace.na[absmad > cutoff] <- NA
        } else if (method=='iqr'){
            # https://stackoverflow.com/a/4788102/2292993
            if(is.na(cutoff)) cutoff = 1.5
            q1 <- quantile(x, 0.25, na.rm=na.rm)
            q3 <- quantile(x, 0.75, na.rm=na.rm)
            # alternatively iqr = q3-q1
            iqr = IQR(x, na.rm = na.rm)
            lower_bound = q1 - (iqr * cutoff)
            upper_bound = q3 + (iqr * cutoff)
            if (!is.null(replacement)) {
                x[(x > upper_bound) | (x < lower_bound)] <- replacement
                } else {
                    if (length(which((x > upper_bound) | (x < lower_bound))) > 0) {
                        x = x[-which((x > upper_bound) | (x < lower_bound))]
                    }
                }
            x.replace.na[(x.replace.na > upper_bound) | (x.replace.na < lower_bound)] <- NA
        }

        newNAs = sum(is.na(x.replace.na)) - oldNAs
        if (print2scr) {
            if (!is.null(col)) {
                cat(sprintf('%-15s %5s(%.2f): %3d outliers found and %s.\n', toString(col), toupper(method), cutoff, newNAs, ifelse((is.null(replacement)|fillout=='todropna'),'REMOVED','REPLACED')))
            } else {
                cat(sprintf('%5s(%.2f): %3d outliers found and %s.\n', toupper(method), cutoff, newNAs, ifelse((is.null(replacement)|fillout=='todropna'),'REMOVED','REPLACED')))
            }
        }

        if (plot){
            # mar controls margin size for individual plot it goes c(bottom, left, top, right)
            # oma is margin for the whole?
            opar = par(mfrow=c(2, 2), oma=c(0,0,1.5,0), mar = c(2,2,1.5,0.5))
            on.exit(par(opar))
            boxplot(x.bak.plot, main=sprintf("With outliers (n=%d)",length(x.bak.plot)))
            hist(x.bak.plot, main=sprintf("With outliers (n=%d)",length(x.bak.plot)), xlab=NULL, ylab=NULL)

            boxplot(x, main=sprintf("With outliers (n=%d)",length(x.bak.plot)-newNAs))
            hist(x, main=sprintf("With outliers (n=%d)",length(x.bak.plot)-newNAs), xlab=NULL, ylab=NULL)
            title(sprintf("%s Outlier Check: %s(%.2f)",toString(col), toupper(method), cutoff), outer=TRUE)
        }
    } else if (is.data.frame(x)) {
        if (length(col)>1 & fillout=='null') {
            cat('I do not know how to remove univariate outliers in multiple cols. fillout: null --> na ...\n')
            fillout='na'
        } else if (fillout=='null') {
            fillout='todropna'
        }
        # trick to pass actual col name
        x[col] = lapply(1:length(col), function(j) {ez.outlier(x=x[col][[j]],col=col[j],method=method,cutoff=cutoff,plot=plot,hack=hack,fillout=fillout,na.rm=na.rm,print2scr=print2scr)})
        if (fillout=='todropna') x=x[complete.cases(x[,col,drop=FALSE]),,drop=FALSE]
    } # end if
    return(invisible(x))
}

r - calculating the outliers in R

5 回答 5

Related

Reference