如果我以 mtcars 为例:
mtcars <- subset(mtcars, select = c("cyl", "disp"))
如何添加另外两列,一列指示值低于/高于中位数,另一列指示值在哪个四分位数?但是,我希望每组cyl
.
这是我希望的具体结果:
cyl disp median_split quartile_split
Toyota Corolla 4 71.1 below_median 1st_quartile
Honda Civic 4 75.7 below_median 1st_quartile
Fiat 128 4 78.7 below_median 1st_quartile
Fiat X1-9 4 79 below_median 2nd_quartile
Lotus Europa 4 95.1 below_median 2nd_quartile
Datsun 710 4 108 median median
Toyota Corona 4 120.1 above_median 3rd_quartile
Porsche 914-2 4 120.3 above_median 3rd_quartile
Volvo 142E 4 121 above_median 4th_quartile
Merc 230 4 140.8 above_median 4th_quartile
Merc 240D 4 146.7 above_median 4th_quartile
Ferrari Dino 6 145 below_median 1st_quartile
Mazda RX4 6 160 etc… etc…
我会很感激帮助。谢谢你。
编辑以下 akun 的回答
在该quartile_split
列中,akun 的答案在每个 cyl 组中留下了最低值NA
。我想我可以通过添加来解决这个问题:
mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution
所以完整的代码是:
library(dplyr)
mtcars <- subset(mtcars, select = c("cyl", "disp"))
# akrun's answer
mtcars <- mtcars %>%
group_by(cyl) %>%
mutate(median_split = c("above_median", "below_median")[1 +
(disp <= median(disp))],
quartile_split = cut(disp, breaks = quantile(disp),
labels = paste0(1:4, "_quartile")))
# addition
mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution
但是,当我更仔细地看时,我也发现了另外一些看起来不太对劲的地方,具体来说,当你只看cyl = 6
组时,你会看到:
cyl disp median_split quartile_split
6 145 below_median 1_quartile
6 160 below_median 1_quartile
6 160 below_median 1_quartile
6 167.6 below_median 2_quartile
6 167.6 below_median 2_quartile
6 225 above_median 4_quartile
6 258 above_median 4_quartile
该组的中位数disp
为 163.8,因此disp = 167.6
应将两辆车归类为“above_median”,而不是“below_median”。
我希望这可以以某种方式解决。再次感谢你。