0

我正在使用一个由不同类别的事件日期组成的日期框架。我的目标是排除与之前的观察相比出现在 80 天内的同一类别的日期。例如:

observation Date       category

1           2015-05-05 green

2           2015-06-08 green #(this should be excluded)

3           2015-09-30 green

4           2014-06-30 red

5           2014-07-30 red    #(this should be excluded)

6           2014-09-30 red    #(this should not be excluded, since it is +80 days from the first obs.)

7           2013-01-01 blue

8           2013-02-01 blue   #(this should be excluded)

9           2013-06-01 blue
                     
10          2013-07-01 blue   #(this should be excluded)

与之前的观察相比,我试图排除在 80 天内出现的同一类别的日期(因为在我的研究中,这被认为与之前的观察相同)。然而,我希望排除可能在观察后 80 天内出现的观察结果。例如,观察 6 将出现在观察 5 的 80 天内,应该排除,因为它出现在 obs 的 80 天内。4这是第一个obs。在那个类别中。你会明白我的目标是什么:D

我在想我可以通过 group_by 函数来做到这一点,然后计算所有门中每个观察之间的天数差异。然而问题在于,在我的示例中,它还会排除像观察 6 这样的日期。

我将非常感谢有关如何以最聪明的方式做到这一点的提示。我尝试搜索以前的主题,但找不到任何有用的东西。

问候阿列克西

编辑:使用 Merijn van Tilborgs 代码的结果示例:

Date       Diff_days remove1 remove2
2015-06-29 119       FALSE FALSE
2015-07-09 7         FALSE TRUE
2015-07-15 6         FALSE TRUE
2015-08-18 34        FALSE TRUE
2015-10-03 46        FALSE TRUE

在此示例中,应保存最后一次观察,因为它距离实际保存的最后一次观察有 +80 天 (2015-06-29)。

Edit2:benimwolfspelz 提出的迭代策略:我用来计算同一类别中每个观察之间的天数的代码:df2 <- df %>%

  • arrange(Date) %>%  
    
  •           group_by(category) %>% 
    
  •           mutate(diff_date = c(0,diff(Date)))
    
4

1 回答 1

0

我稍微扩展了测试集,但是如果我错过了某些情况,您必须对其进行测试。也许可以一步完成,但我只设法在两个临时删除列中完成。

dt <- structure(list(Date = structure(c(16560, 16594, 16708, 16251, 
16281, 16343, 15706, 15737, 15857, 15887, 15888, 16252, 16617, 
16648), class = "Date"), category = c("green", "green", "green", 
"red", "red", "red", "blue", "blue", "blue", "blue", "blue", 
"blue", "blue", "blue"), message = c(NA, "this should be excluded", 
NA, NA, "this should be excluded", "this should not be excluded", 
NA, "this should be excluded", NA, "this should be excluded", 
"this should be excluded", "this should not be excluded", "this should not be excluded", 
NA)), row.names = c(NA, -14L), spec = structure(list(cols = list(
    Date = structure(list(format = ""), class = c("collector_date", 
    "collector")), category = structure(list(), class = c("collector_character", 
    "collector")), message = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"), class = c( 
"data.frame"))

library(lubridate)
library(data.table)
setDT(dt)

dt[, remove1 := lead(Date) - lag(Date) > days(80) & Date - lag(Date) < days(80), by = category][is.na(remove1), remove1 := F]
dt[, remove2 := lag(remove1) == F & Date - lag(Date) < days(80), by = category][is.na(remove2), remove2 := F]

dt

#           Date category                     message remove1 remove2
#  1: 2015-05-05    green                        <NA>   FALSE   FALSE
#  2: 2015-06-08    green     this should be excluded    TRUE    TRUE
#  3: 2015-09-30    green                        <NA>   FALSE   FALSE
#  4: 2014-06-30      red                        <NA>   FALSE   FALSE
#  5: 2014-07-30      red     this should be excluded    TRUE    TRUE
#  6: 2014-09-30      red this should not be excluded   FALSE   FALSE
#  7: 2013-01-01     blue                        <NA>   FALSE   FALSE
#  8: 2013-02-01     blue     this should be excluded    TRUE    TRUE
#  9: 2013-06-01     blue                        <NA>   FALSE   FALSE
# 10: 2013-07-01     blue     this should be excluded   FALSE    TRUE
# 11: 2013-07-02     blue     this should be excluded    TRUE    TRUE
# 12: 2014-07-01     blue this should not be excluded   FALSE   FALSE
# 13: 2015-07-01     blue this should not be excluded   FALSE   FALSE
# 14: 2015-08-01     blue                        <NA>   FALSE    TRUE

dt[!(remove1 == T | remove2 == T)]

#          Date category                     message remove1 remove2
# 1: 2015-05-05    green                        <NA>   FALSE   FALSE
# 2: 2015-09-30    green                        <NA>   FALSE   FALSE
# 3: 2014-06-30      red                        <NA>   FALSE   FALSE
# 4: 2014-09-30      red this should not be excluded   FALSE   FALSE
# 5: 2013-01-01     blue                        <NA>   FALSE   FALSE
# 6: 2013-06-01     blue                        <NA>   FALSE   FALSE
# 7: 2014-07-01     blue this should not be excluded   FALSE   FALSE
# 8: 2015-07-01     blue this should not be excluded   FALSE   FALSE
于 2021-11-22T13:33:00.420 回答