r - Fill count/sum based on previous row count over time series

Question

I have performed counts of events (in Group 1) over a time period for each group (in Group 2). I am looking to spread Group 1 events into separate columns, and using Group 2 and timestamp as rows. Each cell will contain the counts of events over a time period (Present date to the previous 4 days).

See the example below, for each of the Group 2 (I & II) I counted Events A and L in Group 1 happened within 4 days.

dates = as.Date(c("2011-10-09",
   "2011-10-15",
   "2011-10-16", 
   "2011-10-18", 
   "2011-10-21", 
   "2011-10-22", 
   "2011-10-24")) 
group1=c("A",
    "A",
    "A", 
    "A", 
    "L", 
    "L", 
    "A")
group2=c("I",
    "I",
    "I", 
    "I", 
    "I", 
    "I", 
    "II")

df1 <- data.frame(dates, group1, group2)

Using dplyr pipes I managed to produce the following table (also see Count event types over time series by multiple conditions)

df1 %>%
  group_by(group1, group2) %>%
  mutate(count = sapply(dates
                    , function(x){
                      sum(dates <= x & dates > (x-4))
                      }))


   dates group1 group2 count
  <date> <fctr> <fctr> <int>
1 2011-10-09      A      I     1
2 2011-10-15      A      I     1
3 2011-10-16      A      I     2
4 2011-10-18      A      I     3
5 2011-10-21      L      I     1
6 2011-10-22      L      I     2
7 2011-10-24      A     II     1

Eventually, I want to obtain a table similar to this, with Events A & L counts update according to dates (time period = current date - 4 days) in both I & II (Group 2).

         dates  group1 group2  count (A)   count (L)
     1 2011-10-09      A      I        1         0
     2 2011-10-15      A      I        1         0
     3 2011-10-16      A      I        2         0
     4 2011-10-18      A      I        3         0
     5 2011-10-21      L      I        0         1
     6 2011-10-22      L      I        0         2
     7 2011-10-24      A      II       1         0

In a larger dataset, not all events in Group 1 appears in every Group 2. How can I update these empty cells so that it will either 1) carry forward the count from the previous row or 2) update the count based on the updated timestamp/ time period?

Thanks!

score 0 · Accepted Answer

尽管您还不清楚您想要什么（请参阅有关问题的评论），但这里有两种可能的方法。

如果您只想将count列展开（出于某种原因）并用 0 填充（在前 4 天内是否发生过事件）并且仍然按group2细分计数（即使您只标记group1）并保留事件详细信息（例如您的问题中的示例），您只需创建一个带有所需标签的列，然后用于spread创建新列。这个

df1 %>%
  group_by(group1, group2) %>%
  mutate(count = sapply(dates
                        , function(x){
                          sum(dates <= x & dates > (x-4))
                        })) %>%
  ungroup() %>%
  mutate(toSpread = paste0("Count (", group1, ")")) %>%
  spread(toSpread, count, fill = 0)

返回这个：

       dates group1 group2 `Count (A)` `Count (L)`
*     <date> <fctr> <fctr>       <dbl>       <dbl>
1 2011-10-09      A      I           1           0
2 2011-10-15      A      I           1           0
3 2011-10-16      A      I           2           0
4 2011-10-18      A      I           3           0
5 2011-10-21      L      I           0           1
6 2011-10-22      L      I           0           2
7 2011-10-24      A     II           1           0

这与您在问题中显示的输出相匹配。但是，如果您想要在任何一天计算每个 group1 的事件发生了多少，您将需要退后一步。为此，您需要生成一个包含所需日期的新数据框——每个组都有一行。这很容易complete从tidyr. 然后，您可以检查其中的每一个，以了解该组在前四天发生的事件。

df1 %>%
  select(dates, group1) %>%
  complete(dates, group1) %>%
  mutate(count = sapply(1:n()
                        , function(idx){
                          sum(df1$dates <= dates[idx] &
                                df1$dates > (dates[idx]-4) &
                                df1$group1 == group1[idx])
                        })) %>%
  mutate(group1 = paste0("Count (", group1, ")")) %>%
  spread(group1, count, fill = 0)

返回：

# A tibble: 7 x 3
       dates `Count (A)` `Count (L)`
*     <date>       <dbl>       <dbl>
1 2011-10-09           1           0
2 2011-10-15           1           0
3 2011-10-16           2           0
4 2011-10-18           3           0
5 2011-10-21           1           1
6 2011-10-22           0           2
7 2011-10-24           1           2

请注意，如果您想包含没有事件的日期，您可以通过将要签入的日期传递给complete. 例如：

df1 %>%
  select(dates, group1) %>%
  complete(dates = full_seq(dates, 1), group1) %>%
  mutate(count = sapply(1:n()
                        , function(idx){
                          sum(df1$dates <= dates[idx] &
                                df1$dates > (dates[idx]-4) &
                                df1$group1 == group1[idx])
                        })) %>%
  mutate(group1 = paste0("Count (", group1, ")")) %>%
  spread(group1, count, fill = 0)

返回：

        dates `Count (A)` `Count (L)`
 *     <date>       <dbl>       <dbl>
 1 2011-10-09           1           0
 2 2011-10-10           1           0
 3 2011-10-11           1           0
 4 2011-10-12           1           0
 5 2011-10-13           0           0
 6 2011-10-14           0           0
 7 2011-10-15           1           0
 8 2011-10-16           2           0
 9 2011-10-17           2           0
10 2011-10-18           3           0
11 2011-10-19           2           0
12 2011-10-20           1           0
13 2011-10-21           1           1
14 2011-10-22           0           2
15 2011-10-23           0           2
16 2011-10-24           1           2

根据评论，我想我终于理解了目标。首先，如上所述，我将首先创建一个“长”数据框，其中包含每个日期的每个 group1/group2 对的计数：

fullDateCounts <-
  df1 %>%
  select(dates, group1, group2) %>%
  complete(dates = full_seq(dates, 1), group1, group2) %>%
  mutate(count = sapply(1:n()
                        , function(idx){
                          sum(df1$dates <= dates[idx] &
                                df1$dates > (dates[idx]-4) &
                                df1$group1 == group1[idx] &
                                df1$group2 == group2[idx]
                              )
                        }))

最重要的是：

        dates group1 group2 count
       <date> <fctr> <fctr> <int>
 1 2011-10-09      A      I     1
 2 2011-10-09      A     II     0
 3 2011-10-09      L      I     0
 4 2011-10-09      L     II     0
 5 2011-10-10      A      I     1
 6 2011-10-10      A     II     0
 7 2011-10-10      L      I     0
 8 2011-10-10      L     II     0
 9 2011-10-11      A      I     1
10 2011-10-11      A     II     0
# ... with 54 more rows

从那里开始，如果您真的需要转换为宽格式，您可以为每个 group2 （或 group1，如果您切换列名）使用一行：

fullDateCounts %>%
  mutate(group1 = paste0("Count (", group1, ")")) %>%
  spread(group1, count, fill = 0)

返回：

        dates group2 `Count (A)` `Count (L)`
 *     <date> <fctr>       <dbl>       <dbl>
 1 2011-10-09      I           1           0
 2 2011-10-09     II           0           0
 3 2011-10-10      I           1           0
 4 2011-10-10     II           0           0
 5 2011-10-11      I           1           0
 6 2011-10-11     II           0           0
 7 2011-10-12      I           1           0
 8 2011-10-12     II           0           0
 9 2011-10-13      I           0           0
10 2011-10-13     II           0           0
# ... with 22 more rows

或者，您可以为每个 group1/group2 对生成一个列：

fullDateCounts %>%
  mutate(toSpread = paste0("Count (", group1, "-", group2, ")")) %>%
  select(-group1, -group2) %>%
  spread(toSpread, count, fill = 0)

返回

        dates `Count (A-I)` `Count (A-II)` `Count (L-I)` `Count (L-II)`
 *     <date>         <dbl>          <dbl>         <dbl>          <dbl>
 1 2011-10-09             1              0             0              0
 2 2011-10-10             1              0             0              0
 3 2011-10-11             1              0             0              0
 4 2011-10-12             1              0             0              0
 5 2011-10-13             0              0             0              0
 6 2011-10-14             0              0             0              0
 7 2011-10-15             1              0             0              0
 8 2011-10-16             2              0             0              0
 9 2011-10-17             2              0             0              0
10 2011-10-18             3              0             0              0
11 2011-10-19             2              0             0              0
12 2011-10-20             1              0             0              0
13 2011-10-21             1              0             1              0
14 2011-10-22             0              0             2              0
15 2011-10-23             0              0             2              0
16 2011-10-24             0              1             2              0

r - Fill count/sum based on previous row count over time series

1 回答 1

Related