1

我正在尝试复制这个 SO 问题,但是通过使用使用该across()函数的更新语法并远离已弃用的summarise_all()and funs()

起始数据

我有一个数据库提取每种事件类型的一行,如下所示:

library(tidyverse)
library(zoo)

df_start <- tibble(shipment = c(rep("A",4), rep("B",4)), 
             stop = rep(c(1,1,2,2), 2),
             arrive_pickup = as.POSIXct(c("2021-01-01 07:00:00 UTC",NA, NA, NA,"2021-06-05 12:10:00 UTC", NA, NA, NA)),
             depart_pickup = as.POSIXct(c(NA,"2021-01-01 08:40:00 UTC", NA, NA, NA, "2021-06-05 16:58:00 UTC", NA, NA)),
             arrive_delivery = as.POSIXct(c(NA, NA, "2021-01-05 10:00:00 UTC",NA, NA, NA,"2021-06-08 10:58:00 UTC", NA)),
             depart_delivery = as.POSIXct(c(NA, NA, NA, "2021-01-05 11:30:00 UTC",NA, NA, NA,"2021-06-08 13:50:00 UTC"))
)

> df_start
# A tibble: 8 x 6
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 NA                  NA                  NA                 
2 A            1 NA                  2021-01-01 08:40:00 NA                  NA                 
3 A            2 NA                  NA                  2021-01-05 10:00:00 NA                 
4 A            2 NA                  NA                  NA                  2021-01-05 11:30:00
5 B            1 2021-06-05 12:10:00 NA                  NA                  NA                 
6 B            1 NA                  2021-06-05 16:58:00 NA                  NA                 
7 B            2 NA                  NA                  2021-06-08 10:58:00 NA                 
8 B            2 NA                  NA                  NA                  2021-06-08 13:50:00

期望的结果

...并且我想通过按装运和停靠点,甚至只是按装运分组来折叠行数(我不确定是否留NA在最终数据框中会影响答案,但我正在寻求成为能够以任何方式解决它)。

df_finish1 # 一个期望的结果

# A tibble: 4 x 6
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA                  NA                 
2 A            2 NA                  NA                  2021-01-05 10:00:00 2021-01-05 11:30:00
3 B            1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA                  NA                 
4 B            2 NA                  NA                  2021-06-08 10:58:00 2021-06-08 13:50:00

df_finish2 # 第二个/替代的期望结果

# A tibble: 2 x 5
  shipment arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dttm>              <dttm>              <dttm>              <dttm>             
1 A        2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B        2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00

我研究并尝试过的

基于这个 SO question,它确实有效:

df_1 <- df_start %>% 
  group_by(shipment, stop) %>%   # Two groupings
  summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>% 
  filter(row_number()==n())
  
> df_1
# A tibble: 4 x 6
# Groups:   shipment, stop [4]
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA                  NA                 
2 A            2 NA                  NA                  2021-01-05 10:00:00 2021-01-05 11:30:00
3 B            1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA                  NA                 
4 B            2 NA                  NA                  2021-06-08 10:58:00 2021-06-08 13:50:00
df_2 <- df_start %>% 
  group_by(shipment) %>%   # Single grouping
  summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>% 
  filter(row_number()==n())

> df_2
# A tibble: 2 x 6
# Groups:   shipment [2]
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            2 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B            2 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00

但是我看到的是该summarise_all()函数和该funs()函数已被弃用并且不会继续使用,所以我试图了解如何across()正确使用该函数,但没有成功:

df_3 <- df_start %>% 
  group_by(shipment) %>% 
  summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))

> df_3 <- df_start %>% 
+   group_by(shipment) %>% 
+   summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
Error: Problem with `summarise()` input `..2`.
x Input `..2` must be size 4 or 1, not 8.
i An earlier column had size 4.
i Input `..2` is `na.locf(., na.rm = FALSE, fromLast = FALSE)`.
i The error occurred in group 1: shipment = "A".

我已经通读了vignette("colwise")描述差异的内容,并建议我只替换上面显示的语法,但显然我没有做对。帮助?

4

2 回答 2

2

您在代码中有几个语法问题。

1 - 参数.cols.fnsacross您的代码中,across函数在everything()( across(everything())) 之后关闭。

  1. 当您使用.in时,across您需要在它前面加上前缀,~以指定您对传递的函数使用 lambda 表达式。(参见.fns中的论点?across)。

结合这些更改,您可以使用 -

library(dplyr)
library(zoo)

df_start %>% 
  group_by(shipment) %>% 
  summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))

但是,acrosshaseverything()作为默认.cols参数,您也可以在不需要 的情况下应用该函数~,因此另一种编写方式是 -

df_start %>% 
  group_by(shipment) %>% 
  summarise(across(.fns = na.locf, na.rm = FALSE, fromLast = FALSE))
于 2021-07-13T04:46:38.410 回答
1

这是一种选择,在按“装运”、“停止”分组后,根据 NA 值对列进行排序,然后filter将具有所有 NA 的行排除

library(dplyr)
df_start %>%
     group_by(shipment, stop) %>% 
     mutate(across(everything(), ~ .[order(is.na(.))])) %>% 
     filter(!if_all(everything(), is.na)) %>% 
     ungroup
# A tibble: 4 x 6
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA                  NA                 
2 A            2 NA                  NA                  2021-01-05 10:00:00 2021-01-05 11:30:00
3 B            1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA                  NA                 
4 B            2 NA                  NA                  2021-06-08 10:58:00 2021-06-08 13:50:00

对于第二种情况,使用across

df_start %>% 
   group_by(shipment) %>% 
   dplyr::summarise(across(contains("_"), ~ na.omit(.)))
# A tibble: 2 x 5
  shipment arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dttm>              <dttm>              <dttm>              <dttm>             
1 A        2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B        2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00

在 OP 中,它使用na.locf而不是,na.omit并且还有一个错字,即across在没有任何参数的情况下关闭,即如果我们检查这篇文章中的代码,使用的语法是

...across(everything(), ~ .. # correct
...across(everything()) ... # incorrect 

因此,我们只需)要将~function(.) .

df_start %>% 
  group_by(shipment) %>% 
  summarise(across(everything(), ~ na.locf(., na.rm = FALSE, fromLast = FALSE)))
于 2021-07-13T01:01:24.193 回答