2

在 sebastian-c 的帮助下,我发现了日常数据的问题。请参阅:R ifelse 条件:连续 NA 的频率

现在我有一个包含每小时数据的数据集:

set.seed(1234)  
day <- c(rep(1:2, each=24))  
hr <- c(rep(0:23, 2))  
v <- c(rep(NA, 48))   
A <- data.frame(cbind(day, hr, v))  
A$v <- sample(c(NA, rnorm(100)), nrow(A), prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)  

我需要做的是:如果有更多(> =)4个连续丢失的白天(7AM-7PM)或> = 3个连续丢失的夜间(7PM-7AM),我将从数据中删除一整天帧,否则只运行线性插值。因此,第二天应该从数据框中完全删除,因为在白天(上午 7 点到上午 10 点)有 4 个连续的 NA。结果最好是保留数据帧。请帮忙,谢谢!

4

1 回答 1

1

If I modify the NA_run function from the question you linked to take a variable named v instead of value and return the boolean rather than the data.frame:

NA_run <- function(x, maxlen){
  runs <- rle(is.na(x$v))
  any(runs$lengths[runs$values] >= maxlen)
}

I can then write a wrapper function to call it twice for daytime and nighttime:

dropfun <- function(x) {
  dt <- x$hr > 7 & x$hr < 19
  daytime <- NA_run(x[dt,], 4)
  nighttime <- NA_run(x[!dt,], 3)

  any(daytime, nighttime)
}

Which gives me a data.frame of days to drop.

> ddply(A, .(day), dropfun)
  day    V1
1   1  TRUE
2   2 FALSE
> 

We can alter the dropfun to return the dataframe instead though:

dropfun <- function(x) {
  dt <- x$hr > 7 & x$hr < 19
  daytime <- NA_run(x[dt,], 4)
  nighttime <- NA_run(x[!dt,], 3)

  if(any(daytime, nighttime)) NULL else x
}

> ddply(A, .(day), dropfun)
   day hr           v
1    2  0          NA
2    2  1          NA
3    2  2  2.54899107
4    2  3          NA
5    2  4 -0.03476039
6    2  5          NA
7    2  6  0.65658846
8    2  7  0.95949406
9    2  8          NA
10   2  9  1.08444118
11   2 10  0.95949406
12   2 11          NA
13   2 12 -1.80603126
14   2 13          NA
15   2 14          NA
16   2 15  0.97291675
17   2 16          NA
18   2 17          NA
19   2 18          NA
20   2 19 -0.29429386
21   2 20  0.87820363
22   2 21          NA
23   2 22  0.56305582
24   2 23 -0.11028549
> 
于 2012-08-17T19:23:26.697 回答