0

我想在新开发的疾病的数据集中识别这些 ID。该数据集采用日记的形式,人们每天在日记中回答关于他们是否患有这种疾病的“是/否”问题。

ID <- c(1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
Date <- c("2020-03-10","2020-03-11","2020-03-12","2020-03-13","2020-03-14","2020-03-12","2020-03-13","2020-03-14","2020-03-15","2020-03-16","2020-03-17","2020-03-18", "2020-03-12","2020-03-13","2020-03-14","2020-03-15","2020-03-16","2020-03-17","2020-03-18","2020-03-19","2020-03-20")
Disease <- c("No","No","Yes","Yes","Yes","No","No","No", "Yes","Yes","Yes","No","Yes","Yes","No","No","No","Yes","Yes","Yes","Yes")

df <- data.frame(ID, Date, Disease)

df
ID   Date         Disease
1    2020-03-10   No
1    2020-03-11   No
1    2020-03-12   Yes
1    2020-03-13   Yes
1    2020-03-14   Yes
2    2020-03-12   No
2    2020-03-13   No
2    2020-03-14   No
2    2020-03-15   Yes
2    2020-03-16   Yes
2    2020-03-17   Yes
2    2020-03-18   No
3    2020-03-12   Yes
3    2020-03-13   Yes
3    2020-03-14   No
3    2020-03-15   No
3    2020-03-16   No
3    2020-03-17   Yes
3    2020-03-18   Yes
3    2020-03-19   Yes
3    2020-03-20   Yes

但是,要被定性为“新发疾病”,该人必须满足以下条件: 1. 该人必须至少连续两天“是” 2. 该人必须回答“否” ” 在第一个“是”之前至少连续 3 天。

作为输出,我希望有多少人满足这些条件。所以在上面数据集的提取中,这将是两个(ID 2+3)。

有谁知道如何实现这一目标?在此先感谢您的时间!

4

2 回答 2

0

这样做的一个稍微凌乱的方法是使用该dplyr::lag()函数。

 library(tidyverse)
 library(lubridate)
 df %>% 
    mutate(Date = ymd(Date)) %>%
    group_by(ID) %>% 
    mutate(day_1 = lag(Disease, 1, order_by = Date), 
           day_2 = lag(Disease, 2, order_by = Date), 
           day_3 = lag(Disease, 3, order_by = Date), 
           day_4 = lag(Disease, 4, order_by = Date)) %>% 
    filter(day_1 == "No" & day_2 == "No" & day_3 == "No" & day_4 == "Yes" &        Disease == "Yes")
    distinct(ID) %>% 
    summarise("Number of patients matching the condition" = n())

这会按 ID 对行进行分组,因此所有计算都是针对每个人单独计算的。然后它会在前一天、前一天等列中获取最近 4 天的疾病值。然后,检查数据集中的每一行是否符合条件。然后获取唯一的 ID 并计算它们。

于 2020-05-07T10:16:55.527 回答
0

这可能是一种检测Disease列中模式的紧凑方法。这是基于此处提供的类似答案:

https://stackoverflow.com/a/41131260/3460670

定义您想要的模式(在这种情况下,3 个“否”后跟 2 个“是”)。过滤符合此模式的行;包括shiftfromdata.table因为这使用向量 for Map,而不是leadfromdplyr需要长度 1 n

library(tidyverse)
library(data.table)

pattern = c("No", "No", "No", "Yes", "Yes")

df %>%
  group_by(ID) %>%
  filter(Reduce("&", Map("==", shift(Disease, n = 0:(length(pattern) - 1), type = "lead"), pattern))) %>% 
  ungroup() %>%
  summarise(Total = n_distinct(ID))
于 2020-05-07T12:28:58.100 回答