r - 如何为数据整理编写高效的包装器，允许在调用包装器时关闭任何包装的部分

Question

为了简化数据处理，我编写了一个由几个处理数据的“动词函数”组成的包装函数。每个人对数据执行一项任务。但是，并非所有任务都适用于通过此过程的所有数据集，有时，对于某些数据，我可能想关闭一些“动词功能”，并跳过它们。

我试图了解是否存在在 R 中的包装函数中构建此类工作流的常规/规范方法。重要的是，一种高效的方式，无论是性能方面还是简洁的代码。

例子

作为数据整理的一部分，我想执行几个步骤：

清理列标题（使用janitor::clean_names()）
重新编码数据中的值，以便将TRUEandFALSE替换为1and 0（使用gsub()）。
将字符串值重新编码为小写（使用tolower()）。
根据特定id列更宽地旋转（使用tidyr::pivot_wider）
删除带有NA值的行（使用dplyr::drop_na()）

玩具数据

library(stringi)
library(tidyr)

set.seed(2021)

# simulate data
df <-
  data.frame(id = 1:20,
           isMale = rep(c("true", "false"), times = 10),
           WEIGHT = sample(50:100, 20),
           hash_Numb = stri_rand_strings(20, 5)) %>%
  cbind(., score = sample(200:800, size = 20))

# sprinkle NAs randomly
df[c("isMale", "WEIGHT", "hash_Numb", "score")] <-
  lapply(df[c("isMale", "WEIGHT", "hash_Numb", "score")], function(x) {
    x[sample(seq_along(x), 0.25 * length(x))] <- NA
    x
  })


df <- 
  df %>%
  tidyr::expand_grid(., Condition = c("A","B"))

df
#> # A tibble: 40 x 6
#>       id isMale WEIGHT hash_Numb score Condition
#>    <int> <chr>   <int> <chr>     <int> <chr>    
#>  1     1 <NA>       56 EvRAq        NA A        
#>  2     1 <NA>       56 EvRAq        NA B        
#>  3     2 false      87 <NA>        322 A        
#>  4     2 false      87 <NA>        322 B        
#>  5     3 true       95 13pXe       492 A        
#>  6     3 true       95 13pXe       492 B        
#>  7     4 <NA>       88 4WMBS       626 A        
#>  8     4 <NA>       88 4WMBS       626 B        
#>  9     5 true       NA Nrl1W       396 A        
#> 10     5 true       NA Nrl1W       396 B        
#> # ... with 30 more rows

^{由reprex 包(v0.3.0)}
创建于 2021-03-03 数据显示了在两种条件下参加考试的 20 人的考试成绩。对于每个人，我们还知道性别 ( isMale)、体重 ( WEIGHT) 和唯一的hash_number.

数据清理和整理
在将这些数据发送到分析之前，需要根据我在上面列出的某个步骤链对其进行清理。

library(janitor)
library(dplyr)

# helper function
convert_true_false_to_1_0 <- function(x) {
  
  first_pass <- gsub("^(?:TRUE)$", 1, x, ignore.case = TRUE)
  gsub("^(?:FALSE)$", 0, first_pass, ignore.case = TRUE)
}

# chain of steps
df %>%
  janitor::clean_names() %>%
  mutate(across(everything(), convert_true_false_to_1_0)) %>%
  mutate(across(everything(), tolower)) %>%
  pivot_wider(names_from = condition, values_from = score) %>%
  drop_na()

我的问题：如何将此过程打包在允许灵活关闭某些步骤的包装器中？
我的一个想法是使用%>%带有条件的管道，例如：

my_wrangling_wrapper <- function(dat,
                                 clean_names       = TRUE, 
                                 convert_tf_to_1_0 = TRUE, 
                                 convert_to_lower  = TRUE, 
                                 pivot_widr        = TRUE,
                                 drp_na            = TRUE){
  dat %>%
    {if (clean_names)       janitor::clean_names(.)                                     else .} %>%
    {if (convert_tf_to_1_0) mutate(., across(everything(), convert_true_false_to_1_0))  else .} %>%
    {if (convert_to_lower)  mutate(., across(everything(), tolower))                    else .} %>%
    {if (pivot_widr)        pivot_wider(., names_from = condition, values_from = score) else .} %>%
    {if (drp_na)            drop_na(.)                                                  else .}
}

这样，所有步骤都默认发生，除非关闭：

用例 #1 - 默认运行：

> my_wrangling_wrapper(dat = df)

## # A tibble: 6 x 6
##   id    is_male weight hash_numb a     b    
##   <chr> <chr>   <chr>  <chr>     <chr> <chr>
## 1 3     1       95     13pxe     492   492  
## 2 9     1       54     hgzxp     519   519  
## 3 12    0       72     vwetc     446   446  
## 4 15    1       52     qadxc     501   501  
## 5 17    1       71     g42vg     756   756  
## 6 18    0       80     qiejd     712   712

用例 #2 - 不要将true/转换false为1/0并且不要删除NAs：

> my_wrangling_wrapper(dat = df, convert_tf_to_1_0 = FALSE, drp_na = FALSE)

## # A tibble: 20 x 6
##    id    is_male weight hash_numb a     b    
##    <chr> <chr>   <chr>  <chr>     <chr> <chr>
##  1 1     NA      56     evraq     NA    NA   
##  2 2     false   87     NA        322   322  
##  3 3     true    95     13pxe     492   492  
##  4 4     NA      88     4wmbs     626   626  
##  5 5     true    NA     nrl1w     396   396  
##  6 6     false   NA     4oq74     386   386  
##  7 7     true    NA     gg23f     NA    NA   
##  8 8     false   94     NA        NA    NA   
##  9 9     true    54     hgzxp     519   519  
## 10 10    false   97     NA        371   371  
## 11 11    true    90     NA        768   768  
## 12 12    false   72     vwetc     446   446  
## 13 13    NA      NA     jkhjh     338   338  
## 14 14    false   NA     0swem     778   778  
## 15 15    true    52     qadxc     501   501  
## 16 16    false   75     NA        219   219  
## 17 17    true    71     g42vg     756   756  
## 18 18    false   80     qiejd     712   712  
## 19 19    NA      68     tadad     NA    NA   
## 20 20    NA      53     iyw3o     NA    NA

我的问题

尽管我提出的解决方案确实有效，但我了解到在函数中不建议依赖管道运算符，因为它会减慢过程（请参阅参考资料）。此外，由于%>%不是的一部分base R，因此必须有一种方法可以在没有管道的情况下实现相同的“可调整包装”功能。所以我想知道：是否有一种传统的方法来编写一个可以调整以关闭其某些组件的包装函数，并且总体上仍然保持性能高效？

{值得一提的是，我已经问过一个类似的问题，关于为构建包装器ggplot，geoms根据需要关闭。答案很好，但不适用于当前问题。}

score 3 · Accepted Answer

留下来%>%，您可以创建一个功能序列：

library(magrittr)

my_wrangling_wrapper =
  . %>%
  janitor::clean_names() %>%
  mutate(across(everything(), convert_true_false_to_1_0)) %>%
  mutate(across(everything(), tolower)) %>%
  pivot_wider(names_from = condition, values_from = score) %>%
  drop_na()

由于此序列的行为类似于列表，您可以通过选择元素来决定使用哪些步骤：

clean_names       = TRUE
convert_tf_to_1_0 = TRUE 
convert_to_lower  = FALSE 
pivot_widr        = FALSE
drp_na            = TRUE

my_wrangling_wrapper[c(clean_names,
                       convert_tf_to_1_0,
                       convert_to_lower,
                       pivot_widr,
                       drp_na)]

#Functional sequence with the following components:
#
# 1. janitor::clean_names(.)
# 2. mutate(., across(everything(), convert_true_false_to_1_0))
# 3. drop_na(.)

df %>% my_wrangling_wrapper[c(clean_names,
                               convert_tf_to_1_0,
                               convert_to_lower,
                               pivot_widr,
                               drp_na)]()

#  id is_male weight hash_numb score
#1  1       1     51     Zm1Xx   343
#2  3       1     99     Xc2rm   703
#3  6       0     62     2r2cP   243
#4 12       0     84     llI0f   297
#5 16       0     72     AO76M   475
#6 18       0     63     zGJmW   376

如果没有%>%，您可以使用等效的freduce解决方案：

clean_names  <- function(x) janitor::clean_names(x,dat)   

convert_tf_to_1_0 <- function(x) mutate(x,dat, across(everything(),
                                               convert_true_false_to_1_0)) 

convert_to_lower <- function(x) mutate(x,dat, across(everything(), tolower))
         
pivot_widr <- function(x) pivot_wider(x,dat, names_from = condition,
                                             values_from = score) 

drp_na <- function(x) drop_na(x, dat) 

my_wrangling_list <- list(clean_names, convert_tf_to_1_0, drp_na)
magrittr::freduce(df, my_wrangling_list)

或使用%>%and freduce：

df %>% freduce(my_wrangling_list)

我不会太担心管道开销，请在您引用的链接中看到这个答案：比较毫秒时，管道会产生影响，但是当涉及到更大的计算时，管道开销变得可以忽略不计。

score 3 · Accepted Answer

您可以使用闭包来实现与@Waldi 答案中说明的功能序列相同的功能。就像是：

#we build a wrapper generator providing an arbitrary number of functions to apply
wrapperGenerator<-function(...) {
    flist<-list(...)
    function(data, conf = rep(TRUE, length(flist))) {
        if (!is.logical(conf) || (length(conf)!=length(flist)))
            stop("Wrong conf")
        for (i in seq_along(flist)) {
            if (conf[[i]])
                data<-flist[[i]](data)  
        }
        data
    }
}

#An example for string manipulation
wg<-wrapperGenerator(tolower, function(x) paste0(x,"_suff"), function(x) substring(x,1,5))

#some usage
require(stringi)
set.seed(1)
data<-stri_rand_strings(10,10)
data
#[1] "GNZuCtwed3" "CAgNlUizNm" "vDe7GN0NrL" "TbUBpfn6iP" "oemYWm1Tjg"
#[6] "TrRF46JWfP" "uISKeFTl5s" "LqLKTtrOmx" "QiOKkCi7F8" "E3dsmnSPob"

#Full pipeline
wg(data)
#[1] "gnzuc" "cagnl" "vde7g" "tbubp" "oemyw" "trrf4" "uiske" "lqlkt" "qiokk"
#[10] "e3dsm"

#Just the first two steps
wg(data,c(TRUE,TRUE,FALSE))
# [1] "gnzuctwed3_suff" "cagnluiznm_suff" "vde7gn0nrl_suff" "tbubpfn6ip_suff"
# [5] "oemywm1tjg_suff" "trrf46jwfp_suff" "uiskeftl5s_suff" "lqlkttromx_suff"
# [9] "qiokkci7f8_suff" "e3dsmnspob_suff"

编辑

添加一些关于上述工作原理的评论。这wrapperGenerator是一个返回函数的函数，它只是通过提供你想要包装的函数来构建的。这里不需要数据。的值wrapperGenerator本身就是一个函数（wg在示例中），您可以将其应用于实际数据。通过向此函数提供附加conf参数，您可以知道要执行哪些步骤。

闭包是 R 中一个非常强大的工具。在这里您可以找到有关该主题的必读内容。

score 1 · Accepted Answer

我会（作为@Nicola）使用闭包，但使用（恕我直言）稍微干净的界面：

function_factory <- function(...) {
   all_fns <- list(...)
   ## ... arguments must be named
   stopifnot(!is.null(names(all_fns)))
   function(x, ...) {
      selected_fns <- rev(as.character(rlang::ensyms(...)))
      ## if nothing was selected chose everything
      if (!length(selected_fns)) {
         selected_fns <- rev(names(all_fns))
      }
      stopifnot(all(selected_fns %in% names(all_fns)))
      ## function compose operator
      `%.%` <- function(f1, f2) function(...) f1(f2(...))
      fn_seq <- Reduce(`%.%`, all_fns[selected_fns])
      fn_seq(x)
   }
}

## define all potential functions via named(!) arguments to funciton_factory
fn_f <- function_factory(multiply_by_1000 = function(x) x * 1000,
                         make_negative    = function(x) -abs(x),
                         add_100          = function(x) x + 100)

## example with vector as input
x <- 1:10

## to apply a selected subset simply provide the names of the chunks
fn_f(x, add_100)
fn_f(x, multiply_by_1000, make_negative)
## order matters
fn_f(x, add_100, make_negative)
fn_f(x, make_negative, add_100)

## example with data.frame as input
library(dplyr)
m2 <- mtcars
m2 <- m2 %>% 
   mutate(across(everything(), .fns = function(x) {
      x[sample(length(x), 5)] <- NA
      x
   }))
fn_fd <- function_factory(replace_nas = function(data) mutate(data, across(everything(), .fns = coalesce, -1)),
                          round       = function(data) mutate(data, across(where(is.double), .fns = round, 0)),
                          append_new  = function(data) mutate(data, across(c(vs, am), .fns = paste0, "_new")))
fn_fd(m2, replace_nas, round)
fn_fd(m2, replace_nas, append_new, round)
fn_fd(m2, replace_nas, round, append_new)

## toy example from OP
toy_f <- function_factory(clean_names = clean_names,
                          convert_0_1 = function(x) mutate(x, across(everything(), convert_true_false_to_1_0)),
                          to_lower    = function(x) mutate(x, across(everything(), tolower)),
                          pivot       = function(x) pivot_wider(x, names_from = condition, values_from = score),
                          dropna      = drop_na)

## do all
toy_f(df)
## everything but conversion and dropping
toy_f(df, clean_names, to_lower, pivot)

解释

function_factory是 a closure，即它将所有命名（！）参数存储在all_fns并返回 a function。这个想法是这个函数现在可以访问all_fns并发挥它的魔力。
返回的函数还...用于查看我们想要使用哪些元素（rlang::ensyms部分是语法糖，因为使用这种方法我们可以指定不带引号的函数名称）
然后，我们使用将所有选定的函数组合成一个函数Reduce，该函数依次组合给定向量[1] 的元素（这也可以通过来完成purrr::compose）
最终我们将组合函数应用于我们的数据，瞧

[1]?Reduce

score 0 · Accepted Answer

一种方法是

my_wrangling_wrapper <- function(dat,
                                 clean_names       = TRUE, 
                                 convert_tf_to_1_0 = TRUE, 
                                 convert_to_lower  = TRUE, 
                                 pivot_widr        = TRUE,
                                 drp_na            = TRUE){
  

  if (clean_names)       dat <- janitor::clean_names(dat)                                     
  if (convert_tf_to_1_0) dat <- mutate(dat, across(everything(), convert_true_false_to_1_0))  
  if (convert_to_lower)  dat <- mutate(dat, across(everything(), tolower))                    
  if (pivot_widr)        dat <- pivot_wider(dat, names_from = condition, values_from = score) 
  if (drp_na)            dat <- drop_na(dat)                                                  
  dat 
}

r - 如何为数据整理编写高效的包装器，允许在调用包装器时关闭任何包装的部分

例子

我的问题

4 回答 4

解释

Related

Reference