我想进行过采样,以便我在数据集中的每个组中平衡我的二进制因变量。
所以我的数据看起来像这样:
library(dplyr)
library(purrr)
library(tidyr)
seed(123)
# example data
(data <- tibble(
country = c("France", "France", "France",
"UK", "UK", "UK", "UK", "UK", "UK"),
YES = c(0, 0, 1,
0, 0, 0, 0, 1, 1),
X = rnorm(9, 0 ,1)
))
# A tibble: 9 x 3
country YES X
<chr> <dbl> <dbl>
1 France 0 -1.12
2 France 0 -0.200
3 France 1 0.781
4 UK 0 0.100
5 UK 0 0.0997
6 UK 0 -0.380
7 UK 0 -0.0160
8 UK 1 -0.0265
9 UK 1 0.860
我试图通过过采样在法国和英国实现 YES 的平衡。在法国,我希望有 4 个观察结果,而在英国,我希望有 8 个观察结果,以便一个随机样本看起来像这样):
# A tibble: 12 x 3
country YES X
<chr> <dbl> <dbl>
1 France 0 -1.12
2 France 0 -0.200
3 France 1 0.781
3 France 1 0.781
4 UK 0 0.100
5 UK 0 0.0997
6 UK 0 -0.380
7 UK 0 -0.0160
8 UK 1 -0.0265
9 UK 1 0.860
8 UK 1 -0.0265
8 UK 1 -0.0265
我的方法是这样的:
# oversample 1's within each country
(n_data <- data %>%
group_by(country) %>%
nest(.key = "original") %>%
mutate(os = map(original, ~ group_by(., YES))) %>%
mutate(os = map(os, ~ slice_sample(., replace = TRUE, prop = 1))))
# A tibble: 2 x 3
# Groups: country [2]
country original os
<chr> <list> <list>
1 France <tibble [3 x 2]> <tibble [3 x 2]>
2 UK <tibble [6 x 2]> <tibble [6 x 2]>
Warning message:
`.key` is deprecated
所以在操作系统中,尺寸应该是 4 x 2 和 8 x 2。有人知道怎么做吗?