r - 如何从原始数据创建累积辍学率表

Question

我想使用这些数据创建一个累积辍学率表。

DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
     11,12,13,14,15,16,17,18,19,20,
     21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
       2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
   2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
        2,2,2,1,1,2,1,2,1,2,
        1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))

到目前为止，我已经能够达到这一点

     library(tidyverse)

DT %>% 
  group_by(year) %>% 
  count(cohort) %>% 
  ungroup() %>% 
  spread(year, n) %>% 
  mutate(y2014_2015_dropouts = (`2014` - `2015`),
         y2015_2016_dropouts =  (`2015` - `2016`)) %>% 
  mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
         y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%


  replace_na(list(y2014_2015_dropouts = 0.0,
                  y2015_2016_dropouts = 0.0)) %>% 
  select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )

累积辍学率表反映了一个班级内多年来辍学的学生比例。

     # A tibble: 3 x 5
  cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
   <dbl>               <dbl>               <dbl>                 <dbl>                 <dbl>
1      1                   6                   2                   0.6                   0.8
2      2                   0                   2                  NA                    NA  
3      3                   0                   0                  NA                    NA  
>

标题的最后两列显示，到 2014-2015 年年底，60% 的第一组学生辍学；到 2015-2016 年末，第一组 80% 的学生辍学。

我想为第 2 组和第 3 组计算相同的值，但我不知道该怎么做。

score 2 · Accepted Answer

这是一种替代data.table解决方案，可以让您的数据以我认为更容易处理的方式组织起来。使用您的DT输入数据：

按队列和年份组织和排序：

DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]

分配年份范围：

DT2[, year := paste(lag(year), year, sep = "_"),]

每年辍学

DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]

获取每个队列每年退出比例的累计总和：

DT2[, cumul := cumsum(dropouts) / max(N), cohort]

输出：

> DT2
   cohort      year  N dropouts     cumul
1:      1   NA_2014 10        0 0.0000000
2:      1 2014_2015  4        6 0.6000000
3:      1 2015_2016  2        2 0.8000000
4:      2 2016_2015  6        0 0.0000000
5:      2 2015_2016  4        2 0.3333333
6:      3 2016_2016  9        0 0.0000000

score 1 · Accepted Answer

因为您在管道中按年份传播数据，并且您的2014列包含NA与队列 2 相关的所有内容的值，所以您需要在计算中合并分母y2015_2016_cumulative。如果您从当前替换该变量的定义

y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative

至

y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)

你应该很高兴。coalesce 函数尝试第一个参数，但如果第一个参数是，则输入第二个参数NA。话虽如此，这种当前的方法并不是非常可扩展的。您必须为添加的每一年添加额外的合并语句。如果您将数据保持在整洁的格式中，则可以使用以下方法在年份队列级别保持运行列表

DT %>% 
group_by(year) %>% 
count(cohort) %>% 
ungroup() %>% 
group_by(cohort) %>% 
mutate(dropouts = lag(n) - n,
       dropout_rate = dropouts / max(n)) %>% 
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>% 
mutate(cumulative_dropouts = cumsum(dropouts),
       cumulative_dropout_rate = cumulative_dropouts / max(n))

r - 如何从原始数据创建累积辍学率表

2 回答 2

Related

Reference