3

I have a raw data frame that looks like this:

test
   id class                time
1   1 start 2019-06-20 00:00:00
2   1   end 2019-06-20 00:05:00
3   1 start 2019-06-20 00:10:00
4   1   end 2019-06-20 00:15:00
5   2   end 2019-06-20 00:20:00
6   2 start 2019-06-20 00:25:00
7   2   end 2019-06-20 00:30:00
8   2 start 2019-06-20 00:35:00
9   3   end 2019-06-20 00:40:00
10  3 start 2019-06-20 00:45:00
11  3   end 2019-06-20 00:50:00
12  3 start 2019-06-20 00:55:00

My goal is to map the values to an output table for each id only where there is a start and an end in consecutive order (time). Therefore, the output would look like:

output
  id               start                 end
1  1 2019-06-20 00:00:00 2019-06-20 00:05:00
2  1 2019-06-20 00:10:00 2019-06-20 00:15:00
3  2 2019-06-20 00:25:00 2019-06-20 00:30:00
4  3 2019-06-20 00:45:00 2019-06-20 00:50:00

I have tried with the dplyr package, but

test %>% group_by(id) %>% arrange(time) %>% starts_with("start")
Error in starts_with(., "start") : is_string(match) is not TRUE

starts_with always throws an error. I would like to avoid writing a for loop because I am sure this can be handled by a few chain operations. Any ideas for a workaround in dplyr or data.table?

4

4 回答 4

4

One possible approach:

test[, {
        si <- which(class=="start" & shift(class, -1L)=="end")
        .(id, start=time[si], end=time[si + 1L])
    }, by=.(id)]

output:

   id                 start                 end
1:  1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2:  1 1 2019-06-20 00:10:00 2019-06-20 00:15:00
3:  2 2 2019-06-20 00:25:00 2019-06-20 00:30:00
4:  3 3 2019-06-20 00:45:00 2019-06-20 00:50:00

data:

library(data.table)
test <- fread("id,class,time
1,start,2019-06-20 00:00:00
1,end,2019-06-20 00:05:00
1,start,2019-06-20 00:10:00
1,end,2019-06-20 00:15:00
2,end,2019-06-20 00:20:00
2,start,2019-06-20 00:25:00
2,end,2019-06-20 00:30:00
2,start,2019-06-20 00:35:00
3,end,2019-06-20 00:40:00
3,start,2019-06-20 00:45:00
3,end,2019-06-20 00:50:00
3,start,2019-06-20 00:55:00")
于 2019-06-20T00:39:33.097 回答
3

I usually use cumsum() is these cases

test %>% 
  group_by(id) %>%
  arrange(time, .by_group = TRUE) %>%   # should use .by_group arg
  mutate(flag = cumsum(class == "start")) %>%
  group_by(id, flag) %>%
  filter(n() == 2L) %>%
  ungroup() %>%
  spread(class, time) %>%
  select(-flag)
于 2019-06-20T01:20:16.963 回答
2

Using dplyr and tidyr, we can first filter the rows which follow the "start" and "end" pattern, create groups of 2 rows and spread to long format.

library(dplyr)
library(tidyr)

test %>%
  group_by(id) %>%
  filter(class == "start" & lead(class) == "end" | 
         class == "end" & lag(class) == "start") %>%
  group_by(group = gl(n()/2, 2)) %>%
  spread(class, time) %>%
  ungroup() %>%
  select(-group) %>%
  select(id, start, end)

#     id  start              end               
#   <int> <dttm>              <dttm>             
#1     1 2019-06-20 00:00:00 2019-06-20 00:05:00
#2     1 2019-06-20 00:10:00 2019-06-20 00:15:00
#3     2 2019-06-20 00:25:00 2019-06-20 00:30:00
#4     3 2019-06-20 00:45:00 2019-06-20 00:50:00
于 2019-06-20T00:45:11.207 回答
2

You can keep each start row plus the end immediately after it (if any), then use dcast to switch from long to wide form:

test[, 
  if (.N >= 2) head(.SD, 2)
, by=.(g = rleid(id, cumsum(class=="start"))), .SDcols=names(test)][, 
  dcast(.SD, id + g ~ factor(class, levels=c("start", "end")), value.var="time")
]

   id g               start                 end
1:  1 1 2019-06-20 00:00:00 2019-06-20 00:05:00
2:  1 2 2019-06-20 00:10:00 2019-06-20 00:15:00
3:  2 4 2019-06-20 00:25:00 2019-06-20 00:30:00
4:  3 7 2019-06-20 00:45:00 2019-06-20 00:50:00

rleid and cumsum are used to find the sequences; and factor is needed to tell dcast the column order.

Side note: This is essentially the same as @cheetahfly's answer (I didn't realize when I posted): since the cumsum is increasing, it is sufficient to group by id + cumsum and there's no need to use rleid (which is for tracking runs of values). The only difference is that my approach woudl keep a run like start, end, end; while the other answer would filter it out with the n() == 2 check.

于 2019-06-20T01:27:50.957 回答