我想对我的data.frame
对象中的数字变量进行分类dplyr
(并且不知道该怎么做)。
没有dplyr
,我可能会做类似的事情:
df <- data.frame(a = rnorm(1e3), b = rnorm(1e3))
df$a <- cut(df$a , breaks=quantile(df$a, probs = seq(0, 1, 0.2)))
它会完成的。但是,我非常喜欢dplyr
在mutate
我chain
对data.frame
.
我想对我的data.frame
对象中的数字变量进行分类dplyr
(并且不知道该怎么做)。
没有dplyr
,我可能会做类似的事情:
df <- data.frame(a = rnorm(1e3), b = rnorm(1e3))
df$a <- cut(df$a , breaks=quantile(df$a, probs = seq(0, 1, 0.2)))
它会完成的。但是,我非常喜欢dplyr
在mutate
我chain
对data.frame
.
该ggplot2
软件包有 3 个功能可以很好地完成这些任务:
cut_number()
:使 n 组具有(大约)相同数量的观察cut_interval()
:使 n 组具有相等的范围cut_width
: 使宽度宽度组我的首选是cut_number()
因为这使用均匀间隔的分位数进行分箱观察。这是一个带有倾斜数据的示例。
library(tidyverse)
skewed_tbl <- tibble(
counts = c(1:100, 1:50, 1:20, rep(1:10, 3),
rep(1:5, 5), rep(1:2, 10), rep(1, 20))
) %>%
mutate(
counts_cut_number = cut_number(counts, n = 4),
counts_cut_interval = cut_interval(counts, n = 4),
counts_cut_width = cut_width(counts, width = 25)
)
# Data
skewed_tbl
#> # A tibble: 265 x 4
#> counts counts_cut_number counts_cut_interval counts_cut_width
#> <dbl> <fct> <fct> <fct>
#> 1 1 [1,3] [1,25.8] [-12.5,12.5]
#> 2 2 [1,3] [1,25.8] [-12.5,12.5]
#> 3 3 [1,3] [1,25.8] [-12.5,12.5]
#> 4 4 (3,13] [1,25.8] [-12.5,12.5]
#> 5 5 (3,13] [1,25.8] [-12.5,12.5]
#> 6 6 (3,13] [1,25.8] [-12.5,12.5]
#> 7 7 (3,13] [1,25.8] [-12.5,12.5]
#> 8 8 (3,13] [1,25.8] [-12.5,12.5]
#> 9 9 (3,13] [1,25.8] [-12.5,12.5]
#> 10 10 (3,13] [1,25.8] [-12.5,12.5]
#> # ... with 255 more rows
summary(skewed_tbl$counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 3.00 13.00 25.75 42.00 100.00
# Histogram showing skew
skewed_tbl %>%
ggplot(aes(counts)) +
geom_histogram(bins = 30)
# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
ggplot(aes(counts_cut_number)) +
geom_bar()
# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
ggplot(aes(counts_cut_interval)) +
geom_bar()
# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
ggplot(aes(counts_cut_width)) +
geom_bar()
由reprex 包(v0.2.1)于 2018 年 11 月 1 日创建
set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10))
df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))
给予:
a b
1 (-0.586,-0.316] 1.2240818
2 (-0.316,0.094] 0.3598138
3 (0.68,1.72] 0.4007715
4 (-0.316,0.094] 0.1106827
5 (0.094,0.68] -0.5558411
6 (0.68,1.72] 1.7869131
7 (0.094,0.68] 0.4978505
8 <NA> -1.9666172
9 (-1.27,-0.586] 0.7013559
10 (-0.586,-0.316] -0.4727914