1

我想根据另一个变量的值重新调整因子变量。例如:

factors <- structure(list(color = c("RED", "GREEN", "BLUE", "YELLOW", "BROWN"
), count = c(2, 5, 11, 1, 19)), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

> factors
# A tibble: 5 x 2
  color  count
  <chr>  <dbl>
1 RED        2
2 GREEN      5
3 BLUE      11
4 YELLOW     1
5 BROWN     19


这是我想要制作的:

##Group all levels with count < 10 into "OTHER"

> factors.out
# A tibble: 3 x 2
  color count
  <chr> <dbl>
1 OTHER     8
2 BLUE     11
3 BROWN    19


我认为这是一份工作forcats::fct_lump()

##Keep 3 levels
factors %>%
+   mutate(color = fct_lump(color, n = 3))
# A tibble: 5 x 2
  color  count
  <fct>  <dbl>
1 RED        2
2 GREEN      5
3 BLUE      11
4 YELLOW     1
5 BROWN     19


我知道可以通过以下方式做到这一点:

factors %>%
  mutate(color = ifelse(count < 10, "OTHER", color)) %>%
  group_by(color) %>%
  summarise(count = sum(count))


但我认为或希望在forcats.


4

1 回答 1

2

因为您已经有一个包含因子和计数的 data.frame,所以在将最罕见的观察集中在一起时,您可以使用这些计数作为权重。第二阶段只涉及像您的示例中那样折叠 OTHER 类别。

factors <- structure(list(color = c("RED", "GREEN", "BLUE", "YELLOW", "BROWN"),
  count = c(2, 5, 11, 1, 19)), row.names = c(NA, -5L), class = c("tbl_df", 
  "tbl", "data.frame"))

library("dplyr")
library("forcats")

factors.out <- factors %>%
  mutate(color = fct_lump(color, n = 2, other_level = "OTHER",
    w = count)) %>%
  group_by(color) %>%
  summarise(count = sum(count)) %>%
  arrange(count)

给予

factors.out 
# A tibble: 3 x 2
  color count
  <fct>  <dbl>
1 OTHER     8
2 BLUE     11
3 BROWN    19
于 2018-08-02T17:22:24.013 回答