5

我想创建一个升级版本,dplyr::bind_rows以避免在Unequal factor levels: coercing to character我们尝试组合的 dfs 中存在因子列时出现警告(可能也有非因子列)。这是一个例子:

df1 <- dplyr::data_frame(age = 1:3, gender = factor(c("male", "female", "female")), district = factor(c("north", "south", "west")))
df2 <- dplyr::data_frame(age = 4:6, gender = factor(c("male", "neutral", "neutral")), district = factor(c("central", "north", "east")))

然后bind_rows_with_factor_columns(df1, df2)返回(没有警告):

dplyr::data_frame(
  age = 1:6,
  gender = factor(c("male", "female", "female", "male", "neutral", "neutral")),
  district = factor(c("north", "south", "west", "central", "north", "east"))
)

这是我到目前为止所拥有的:

bind_rows_with_factor_columns <- function(...) {
  factor_columns <- purrr::map(..., function(df) {
      colnames(dplyr::select_if(df, is.factor))
  })

  if (length(unique(factor_columns)) > 1) {
      stop("All factor columns in dfs must have the same column names")
  }

  df_list <- purrr::map(..., function (df) {
    purrr::map_if(df, is.factor, as.character) %>% dplyr::as_data_frame()
  })

  dplyr::bind_rows(df_list) %>%
    purrr::map_at(factor_columns[[1]], as.factor) %>%
    dplyr::as_data_frame()
}

我想知道是否有人对如何合并该forcats软件包有任何想法,以潜在地避免对角色强制因素,或者是否有人总体上有任何建议来提高其性能同时保持相同的功能(我想坚持tidyverse语法)。谢谢!

4

2 回答 2

1

将根据朋友的出色解决方案回答我自己的问题:

bind_rows_with_factor_columns <- function(...) {
  purrr::pmap_df(list(...), function(...) {
    cols_to_bind <- list(...)
    if (all(purrr::map_lgl(cols_to_bind, is.factor))) {
      forcats::fct_c(cols_to_bind)
    } else {
      unlist(cols_to_bind)
    }
  })
}
于 2017-02-16T15:42:25.223 回答
1

它可能更简单,使用dplyr::bind_rows抑制警告,然后将所有新字符列转换回因子。这具有按列名绑定的优点data.frames(允许列的不同顺序和包含额外的列),并且当因子变量有时记录为字符时仍然有效。

library(tidyverse)

bind_rows_keep_factors <- function(...) {
  ## Identify all factors
  factors <- unique(unlist(
    map(list(...), ~ select_if(..., is.factor) %>% names())
  ))
  ## Bind dataframes, convert characters back to factors
  suppressWarnings(bind_rows(...)) %>% 
    mutate_at(vars(one_of(factors)), factor)  
}

dat1 <- tibble(
  id = 1:2,
  fruit = factor(c("banana", "apple"))
)

dat2 <- tibble(
  id = 3:4,
  fruit = c("pear", "banana"),
  taste = c("Mmmm", "yum")
)

bind_rows_keep_factors(dat1, dat2)
# A tibble: 4 x 3
     id fruit  taste
  <int> <fct>  <chr>
1     1 banana NA   
2     2 apple  NA   
3     3 pear   Mmmm 
4     4 banana yum 

当然,因子水平的顺序被打乱了(恢复为字母顺序)。

于 2018-04-06T01:57:46.893 回答