0

如果我以 mtcars 为例:

 mtcars <- subset(mtcars, select = c("cyl", "disp"))

如何添加另外两列,一列指示值低于/高于中位数,另一列指示值在哪个四分位数?但是,我希望每组cyl.

这是我希望的具体结果:

                   cyl  disp    median_split    quartile_split
 Toyota Corolla    4    71.1    below_median    1st_quartile
 Honda Civic       4    75.7    below_median    1st_quartile
 Fiat 128          4    78.7    below_median    1st_quartile
 Fiat X1-9         4    79      below_median    2nd_quartile
 Lotus Europa      4    95.1    below_median    2nd_quartile
 Datsun 710        4    108     median          median
 Toyota Corona     4    120.1   above_median    3rd_quartile
 Porsche 914-2     4    120.3   above_median    3rd_quartile
 Volvo 142E        4    121     above_median    4th_quartile
 Merc 230          4    140.8   above_median    4th_quartile
 Merc 240D         4    146.7   above_median    4th_quartile
 Ferrari Dino      6    145     below_median    1st_quartile
 Mazda RX4         6    160     etc…            etc…

我会很感激帮助。谢谢你。

编辑以下 akun 的回答

在该quartile_split列中,akun 的答案在每个 cyl 组中留下了最低值NA。我想我可以通过添加来解决这个问题:

 mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution

所以完整的代码是:

 library(dplyr)
 mtcars <- subset(mtcars, select = c("cyl", "disp"))
 # akrun's answer
 mtcars <- mtcars %>%
     group_by(cyl) %>% 
     mutate(median_split = c("above_median", "below_median")[1 + 
                   (disp <= median(disp))], 
            quartile_split = cut(disp, breaks = quantile(disp), 
                 labels = paste0(1:4, "_quartile")))
 # addition
 mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution

但是,当我更仔细地看时,我也发现了另外一些看起来不太对劲的地方,具体来说,当你只看cyl = 6组时,你会看到:

 cyl  disp      median_split    quartile_split
 6    145       below_median    1_quartile
 6    160       below_median    1_quartile
 6    160       below_median    1_quartile
 6    167.6     below_median    2_quartile
 6    167.6     below_median    2_quartile
 6    225       above_median    4_quartile
 6    258       above_median    4_quartile

该组的中位数disp为 163.8,因此disp = 167.6应将两辆车归类为“above_median”,而不是“below_median”。

我希望这可以以某种方式解决。再次感谢你。

4

2 回答 2

2

一个选项是按“cyl”分组,用于cut根据quantile“disp”列创建不同的类别

library(dplyr)
mtcars %>%
    group_by(cyl) %>% 
    mutate(median_split = c("above_median", "below_median")[1 + 
                  (disp <= median(disp))], 
           quartile_split = cut(disp, breaks = quantile(disp), 
                labels = paste0(1:4, "_quartile")))
于 2019-07-24T21:24:34.150 回答
1

以 R 和 为基数cut

mtcars <- subset(mtcars, select = c("cyl", "disp"))
mtcars$median_split <- ifelse(mtcars$disp <= median(mtcars$disp), "below_median","above_median")
mtcars$quantile_split <- cut(mtcars$disp, breaks = c(0, quantile(mtcars$disp)),labels = c("1_quartile",paste0(1:4, "_quartile")))

使用该函数时要小心,cut以确保中断包括最小值(否则它将返回 NA),并且该最小值。标记在第一个四分位数。

于 2019-07-24T21:47:22.037 回答