r - 基于以某个字符开头的级别自动重构？

Question

我正在寻找一种方法来根据级别中的某些模式自动重新编码变量中的因子。我打算将解决方案迭代到更大的数据集。

我有一个更大的数据集，其中包含下面显示的示例的多个实例。这些级别往往具有以下模式：

主要类别是 1、2、3 和 4。级别 11、12、13 和 14 是级别 1 的子类别。我希望能够简化分组过程。我已经使用成功执行了重构fct_recode，但我的意图是将这个过程扩展到遵循类似编码模式的其他变量。

library(tidyverse)

dat <- tribble(
  ~Ethnicity, 
  "1",
  "2",
  "3",
  "4",
  "11",
  "12",
  "13",
  "14",
  "11",
  "13",
  "12",
  "12",
  "11",
  "13")

dat <- mutate_at(dat, vars(Ethnicity), factor)

count(dat, Ethnicity)
#> # A tibble: 8 x 2
#>   Ethnicity     n
#>   <fct>     <int>
#> 1 1             1
#> 2 11            3
#> 3 12            3
#> 4 13            3
#> 5 14            1
#> 6 2             1
#> 7 3             1
#> 8 4             1

dat %>% 
  mutate(Ethnicity = fct_recode(Ethnicity,
                                "1" = "1",
                                "1" = "11",
                                "1" = "12",
                                "1" = "13",
                                "1" = "14"
                                )) %>% 
  count(Ethnicity)
#> # A tibble: 4 x 2
#>   Ethnicity     n
#>   <fct>     <int>
#> 1 1            11
#> 2 2             1
#> 3 3             1
#> 4 4             1

^{由reprex 包（v0.2.1）于 2019 年 5 月 31 日创建}

正如预期的那样，此方法成功地将 11、12、13 和 14 的子类别分组为 1。有没有办法在不手动更改每个子类别的级别的情况下做到这一点？将这个过程扩展到具有相同模式的几个变量的一般方法是什么？谢谢你。

score 1 · Accepted Answer

您可以使用fct_collapsewith grep/ regex 并根据需要调整正则表达式模式：

dat %>%
  mutate(Ethnicity = fct_collapse(Ethnicity, 
                                  "1" = unique(grep("^1", Ethnicity, value = T)))) %>%
  count(Ethnicity)

# A tibble: 4 x 2
  Ethnicity     n
  <fct>     <int>
1 1            11
2 2             1
3 3             1
4 4             1

或者，这感觉有点骇人听闻，但您始终可以使用ifelseor case_when：

dat %>%
  mutate(Ethnicity = factor(ifelse(startsWith(as.character(Ethnicity), "1"), 1, Ethnicity))) %>%
  count(Ethnicity)

# A tibble: 4 x 2
  Ethnicity     n
  <fct>     <int>
1 1            11
2 2             1
3 3             1
4 4             1

score 1 · Accepted Answer

一种选择是创建一个命名向量并计算 ( !!!)

library(dplyr)
library(forcats)
lvls <- levels(dat$Ethnicity)[substr(levels(dat$Ethnicity), 1, 1) == 1]
nm1 <- setNames(lvls, rep(1, length(lvls)))
dat %>% 
     mutate(Ethnicity = fct_recode(Ethnicity, !!!nm1)) %>% 
     count(Ethnicity)
# A tibble: 4 x 2
#  Ethnicity     n
#  <fct>     <int>
#1 1            11
#2 2             1
#3 3             1
#4 4             1

或者另一种选择是levels根据substring设置

levels(dat$Ethnicity)[substr(levels(dat$Ethnicity), 1, 1) == 1] <- 1
dat %>% 
   count(Ethnicity)

对于多列，使用mutate_at并指定感兴趣的变量

dat %>% 
    mutate_at(vars(colsOfInterst), list(~ fct_recode(., !!! nm1)))

r - 基于以某个字符开头的级别自动重构？

2 回答 2

Related

Reference