第一点是你的数据太多了NA
。如果你想调查一下。如果我理解正确,您需要连续 0 的计数,然后是每个日期的连续非零、零、非零等。
这当然可以实现rle
,正如@mnel 在评论下提到的那样。但是有不少陷阱。
首先,我将使用非 NA 条目设置数据:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
接下来,我将创建一个Date
列:
flow$g1 <- as.Date(flow$Date)
最后,我更喜欢使用data.table
. 所以这里有一个使用它的解决方案。
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
因此,例如,对于2010-06-01
,有 7220's
后跟 138 non-zeros
,然后是 320's
后跟 79non-zeros
等等......