0

假设我有一个单位数据集,这些单位可以随时间将活动状态从活动状态更改为非活动状态。我想记录每次单位更改活动时从活动到非活动的切换。一个可重现的例子:

UNIT <- c(100,100, 200, 200, 200, 200, 200, 300, 300, 300,300)
STATUS <- c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE','ACTIVE',
        'ACTIVE','INACTIVE') 
TERMINATED <- c('1999-07-06' , '2008-12-05' , '2000-08-18' , '2000-08-18' ,'2000-08-18' ,'2008-08-18',
            '2008-08-18','2006-09-19','2006-09-19' ,'2006-09-19' ,'1999-03-15') 
START <- c('2007-04-23','2008-12-06','2004-06-01','2007-02-01','2008-04-19','2010-11-29','2010-12-30',
       '2007-10-29','2008-02-05','2008-06-30','2009-02-07')
STOP <- c('2008-12-05','4712-12-31','2007-01-31','2008-04-18','2010-11-28','2010-12-29','4712-12-31',
      '2008-02-04','2008-06-29','2009-02-06','4712-12-31')
DAT <- data.frame(UNIT,STATUS,TERMINATED,START,STOP)
DAT            
UNIT   STATUS TERMINATED      START       STOP
1   100   ACTIVE 1999-07-06 2007-04-23 2008-12-05
2   100 INACTIVE 2008-12-05 2008-12-06 4712-12-31
3   200   ACTIVE 2000-08-18 2004-06-01 2007-01-31
4   200   ACTIVE 2000-08-18 2007-02-01 2008-04-18
5   200 INACTIVE 2000-08-18 2008-04-19 2010-11-28
6   200   ACTIVE 2008-08-18 2010-11-29 2010-12-29
7   200 INACTIVE 2008-08-18 2010-12-30 4712-12-31
8   300   ACTIVE 2006-09-19 2007-10-29 2008-02-04
9   300   ACTIVE 2006-09-19 2008-02-05 2008-06-29
10  300   ACTIVE 2006-09-19 2008-06-30 2009-02-06
11  300 INACTIVE 1999-03-15 2009-02-07 4712-12-31

当一个单元的状态从 ACTIVE 变为 INACTIVE 时,这意味着该单元已被终止。不幸的是,记录的终止日期 (TERMINATED) 无效。有效的终止日期是从活动切换到非活动后的有效开始日期(当 STATUS == INACTIVE 时)减去 1 天。换句话说,先前活动记录的结束日期。例如,在单元 100 的情况下,第 3 行中的 TERMINATED 日期是正确的。然而,单元 300 的终止日期应为“2009-02-06”。该解决方案应该足够健壮,以便它了解单元 200 具有两个不活动状态并相应地进行填充。

我什至不知道在 R 中从哪里开始这样的事情

最终结果应如下所示:

   UNIT   STATUS TERMINATED      START       STOP
1   100   ACTIVE 2008-12-05 2007-04-23 2008-12-05
2   100 INACTIVE 2008-12-05 2008-12-06 4712-12-31
3   200   ACTIVE 2008-04-18 2004-06-01 2007-01-31
4   200   ACTIVE 2008-04-18 2007-02-01 2008-04-18
5   200 INACTIVE 2008-04-18 2008-04-19 2010-11-28
6   200   ACTIVE 2010-12-29 2010-11-29 2010-12-29
7   200 INACTIVE 2010-12-29 2010-12-30 4712-12-31
8   300   ACTIVE 2009-02-06 2007-10-29 2008-02-04
9   300   ACTIVE 2009-02-06 2008-02-05 2008-06-29
10  300   ACTIVE 2009-02-06 2008-06-30 2009-02-06
11  300 INACTIVE 2009-02-06 2009-02-07 4712-12-31
4

1 回答 1

5

我没有花太多时间在这上面,但我认为你应该能够通过以下方式做你需要的事情。

  1. 将您的日期转换为实际的日期格式。

    ## Use a real date format
    DAT[-c(1, 2)] <- lapply(DAT[-c(1, 2)], as.Date)
    
  2. 根据 UNIT 的组​​合和 STATUS 列发生变化时创建“组”。

    ## Identify the "groups" of "ACTIVE" and "INACTIVE"
    ##    by a combination of the first two columns
    RLE <- rle(do.call(paste, DAT[1:2]))$lengths
    RLES <- rep(seq_along(RLE), RLE)
    RLES
    # [1] 1 2 3 3 4 5 6 7 7 7 8
    

    您可以在这里看到第 1 行来自第一个“组”,第 2 行来自第二个“组”,第三行和第四行来自第三个,依此类推。

  3. 替换当前的 TERMINATED 列。

    通过使用存储在 中的结果RLES,我们可以使用ave创建一个与包含最后一个 STOP 日期的行数相同长度的向量。

    ## Use that grouping to create a partially corrected
    ##   "TERMINATED" column
    DAT$TERMINATED <- ave(DAT$STOP, RLES, FUN = max)
    
  4. 修复 STATUS == "INACTIVE" 时的 TERMINATED 值。

    根据您的描述,此处的值应等于“开始”列中的值减去 1。

    ## Identify the rows where STATUS == "INACTIVE"
    IRows <- which(DAT$STATUS == "INACTIVE")
    ## Since you have a real date format, you can
    ##    simply use "-1" to adjust the TERMINATED date
    ##    using the value from the "START" date
    DAT[IRows, "TERMINATED"] <- DAT[IRows, "START"] - 1
    
  5. 检查结果。

    DAT
    #    UNIT   STATUS TERMINATED      START       STOP
    # 1   100   ACTIVE 2008-12-05 2007-04-23 2008-12-05
    # 2   100 INACTIVE 2008-12-05 2008-12-06 4712-12-31
    # 3   200   ACTIVE 2008-04-18 2004-06-01 2007-01-31
    # 4   200   ACTIVE 2008-04-18 2007-02-01 2008-04-18
    # 5   200 INACTIVE 2008-04-18 2008-04-19 2010-11-28
    # 6   200   ACTIVE 2010-12-29 2010-11-29 2010-12-29
    # 7   200 INACTIVE 2010-12-29 2010-12-30 4712-12-31
    # 8   300   ACTIVE 2009-02-06 2007-10-29 2008-02-04
    # 9   300   ACTIVE 2009-02-06 2008-02-05 2008-06-29
    # 10  300   ACTIVE 2009-02-06 2008-06-30 2009-02-06
    # 11  300 INACTIVE 2009-02-06 2009-02-07 4712-12-31
    
于 2013-03-23T07:38:14.413 回答