3

我有一个看起来像这样的大型数据集:

set.seed(1234)
id <- c(3,3,3,5,5,7)
amount <- c(24,48,60,84,96,175)
start <- as.Date(c("2006-01-01","2009-12-09","2010-01-01","2006-04-24", "2009-12-09","2009-05-01"))
end <- as.Date(c("2010-01-01","2010-01-01","2010-01-01","2009-12-09","2009-12-09", "2009-05-01"))               
noise <-rnorm(6)
test <- data.frame(id,amount,start,end,noise)            

  id amount      start        end      noise
   3     24 2006-01-01 2010-01-01  0.4978505
   3     48 2009-12-09 2010-01-01 -1.9666172
   3     60 2010-01-01 2010-01-01  0.7013559
   5     84 2006-04-24 2009-12-09 -0.4727914
   5     96 2009-12-09 2009-12-09 -1.0678237
   7    175 2009-05-01 2009-05-01 -0.2179749

但它需要看起来像这样:

  id amount      start        end      noise   switch
   3     24 2006-01-01 2009-12-09  0.4978505        0
   3     48 2009-12-09 2010-01-01 -1.9666172        1
   3     60 2010-01-01 2010-01-01  0.7013559        2
   5     84 2006-04-24 2009-12-09 -0.4727914        0 
   5     96 2009-12-09 2009-12-09 -1.0678237        1
   7    175 2009-05-01 2009-05-01 -0.2179749        0

也就是说,我想滞后 start 的值并用 ID 替换 end 的值。其次,我想创建一个名为“switch”的新变量,它计算 id 上的“数量”变化的次数,第一个观察值是 == 0 的初始条件。我曾尝试使用ts()来制造滞后,这在原则上可以满足我的要求,尽管它会产生一个 ts 对象而不是日期:

       out <- cbind(as.ts(test$start),lag(test$start))
       colnames(out) <- c("start","end")
       cbind(as.ts(test$start),lag(test$start))

         as.ts(test$start) lag(test$start)
            NA           13149
          13149           14587
          14587           14610
          14610           13262
          13262           14587
          14587           14365
          14365              NA

所以该lag(test$start)列是我的最终结果,但应用于 id 变量。所以我尝试矢量化并将其应用于 id 变量:

        #make it a function 
        lagfun <- function(x){
          cbind(as.ts(x),lag(x))
        }

        y <- unlist(tapply(start,id,lagfun))     

这就是事情变得非常丑陋的地方。有没有更好的方法来解决这个问题?

4

1 回答 1

5

如果您将时间序列放在 adata.table中,则可以在一行中完成此操作:

testDT[ , c("end", "switch") := 
          list( c(tail(start, -1), tail(end, 1)), cumsum(c(0, diff(amount) != 0)))
      , by=id]

这里分解:

# create your data.table object 
library(data.table)
testDT <- data.table(test)


# Modify `end` by taking the lag of start and the final date from end. 
#   do this `by=id`
testDT[, end := c(tail(start, -1), tail(end, 1)), by=id]

# Count the ammount of times that each amount differs from the 
#  previous ammount value.  
# Start this vector at 0, and take the cummulative sum. 
#  also do this by id 
testDT[, switch := cumsum(c(0, diff(amount) != 0)), by=id]

# this is the final result. 
testDT
   id amount      start        end      noise switch
1:  3     24 2006-01-01 2009-12-09 -1.2070657      0
2:  3     48 2009-12-09 2010-01-01  0.2774292      1
3:  3     60 2010-01-01 2010-01-01  1.0844412      2
4:  5     84 2006-04-24 2009-12-09 -2.3456977      0
5:  5     96 2009-12-09 2009-12-09  0.4291247      1
6:  7    175 2009-05-01 2009-05-01  0.5060559      0
于 2013-04-06T06:04:31.010 回答