1
dput(df)
structure(list(Process = c("PROC050D", "PROC051D", "PROC100D", 
"PROC103D", "PROC104D", "PROC106D", "PROC106D", "PROC110D", "PROC111D", 
"PROC112D", "PROC113D", "PROC114D", "PROC130D", "PROC131D", "PROC132D", 
"PROC154D", "PROC155D", "PROC156D", "PROC157D", "PROC158D", "PROC159D", 
"PROC160D", "PROC161D", "PROC162D", "PROC163D", "PROC164D", "PROC165D", 
"PROC166D", "PROC170D", "PROC171D", "PROC173D", "PROC174D", "PROC177D", 
"PROC180D", "PROC181D", "PROC182D", "PROC185D", "PROC186D", "PROC187D", 
"PROC190D", "PROC191D", "PROC192D", "PROC196D", "PROC197D", "PROC201D", 
"PROC202D", "PROC203D", "PROC204D", "PROC205D", "PROC206D"), 
    Date = structure(c(15393, 15393, 15393, 15393, 15393, 15393, 
    15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393, 
    15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393, 
    15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393, 
    15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393, 
    15393, 15393, 15393, 15393, 15393, 15393, 15393, 15393), class = "Date"), 
    Duration = c(30L, 78L, 20L, 15L, 129L, 56L, 156L, 10L, 1656L, 
    1530L, 52L, 9L, 10L, 38L, 48L, 9L, 26L, 90L, 15L, 23L, 13L, 
    9L, 34L, 12L, 11L, 16L, 24L, 11L, 236L, 104L, 9L, 139L, 11L, 
    10L, 22L, 11L, 55L, 35L, 12L, 635L, 44L, 337L, 44L, 9L, 231L, 
    32L, 19L, 170L, 22L, 19L)), .Names = c("Process", "Date", 
"Duration"), row.names = c(NA, 50L), class = "data.frame")

我正在尝试使用 IQR 方法从我的数据中捕获异常值。但是当我使用这些数据时,我也捕获了可能正常的数据。我喜欢从我的数据点中删除季节性,然后应用异常值规则。

进程列上有数千个不同的进程。我只需要捕获不正常的进程持续时间。任何想法如何从我的数据集中删除季节性?下面的代码计算异常值,但由于季节性因素,异常值可能是正常的。在计算异常值之前,我想从我的数据框中删除季节性。

library(data.table)

df<-df[, seventyFifth := quantile(Duration, .75), by = Process]
df<-df[, twentyFifth := quantile(Duration, .25), by = Process]
df<-df[, IQR := (seventyFifth-twentyFifth), by = Process]

df$diff<-df$Duration-df$seventyFifth

df<-df[, outlier := diff > 3 * IQR, by = Process]
4

2 回答 2

6

为了解决可能的季节性模式,我将首先使用acf(df$Duration)寻找不同滞后的自相关。如果我什么都没看到,我可能不会担心它,除非我有先验的理由来建模它。您的样本数据没有显示季节性的证据,因为——除了始终为 1 的自相关——唯一的相关性是滞后 1 并且是适度的:

在此处输入图像描述

一种不仅可以处理季节性因素(周期性重复发生的事件)而且还可以处理趋势(常态的缓慢变化)的方法令人钦佩stl(),特别是在Rob J Hyndman 的这篇文章中实施。

decompHyndman 给出的函数(如下所示)非常有助于检查季节性,然后将时间序列分解为季节性(如果存在)、趋势和剩余分量。

decomp <- function(x,transform=TRUE)
{
  #decomposes time series into seasonal and trend components
  #from http://robjhyndman.com/researchtips/tscharacteristics/
  require(forecast)
  # Transform series
  if(transform & min(x,na.rm=TRUE) >= 0)
  {
    lambda <- BoxCox.lambda(na.contiguous(x))
    x <- BoxCox(x,lambda)
  }
  else
  {
    lambda <- NULL
    transform <- FALSE
  }
  # Seasonal data
  if(frequency(x)>1)
  {
    x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
    trend <- x.stl$time.series[,2]
    season <- x.stl$time.series[,1]
    remainder <- x - trend - season
  }
  else #Nonseasonal data
  {
    require(mgcv)
    tt <- 1:length(x)
    trend <- rep(NA,length(x))
    trend[!is.na(x)] <- fitted(gam(x ~ s(tt)))
    season <- NULL
    remainder <- x - trend
  }
  return(list(x=x,trend=trend,season=season,remainder=remainder,
    transform=transform,lambda=lambda))
}

如您所见,如果有季节性,则使用stl()(使用黄土),如果没有季节性,则使用惩罚回归样条。

在您的情况下,您可能会以这种方式使用该功能:

# makemodel
df.decomp <- decomp(df$Duration)

# add results into df
if (!is.null(df.decomp$season)){
    df$season <- df.decomp$season} else 
    {df$season < - 0}
df$trend <- df.decomp$trend
df$Durationsmoothed <- df.decomp$remainder

# if you don't want to detrend
df$Durationsmoothed <- df$Durationsmoothed+df$trend

您应该查阅参考的博客文章,因为它进一步发展了这种分析。

于 2012-11-05T20:14:33.287 回答
2

It depends on how predictable or smooth the seasonality is. Is it something where you can make a loose model of it? For example,

LM <- lm(duration~sin(Date)+cos(Date))

Or some variation. Then you can analyze data only as far as they differ from the predicted seasonality:

P <- predict(LM)
DIF <- P-df$duration

Then you could use IQR on dif. And speaking of dif, you may get some helpful information by sorting the data by Date and using diff.

df <- df[order(df$Date),]
DIF2 <- diff(df$Date)
plot(diff(df$Date))

Theoretically, DIF2 should be the derivative of the function produced in LM.

As a side note, if there is one, I would not recommend taking a very systematic approach (i.e., loading a package and doing BlindlyGetRidOfOultliersAdjustingForSeasonality(df) if the seasonality is indeed complex.

于 2012-11-05T20:00:46.737 回答