我是tidymodels的新手,也有点新R
。我正在尝试从 Youtube tidytuesday/Sliced Customer churn data 复制 David Robinson 的代码,但在对交叉验证数据/重新采样应用配方更改时遇到问题。
问题:当我对训练数据执行step_mutate()时,它可以工作,但是当我对交叉验证的数据应用相同的配方时,它会给出错误: train_5folds
Error: All of the models failed. See the .notes column.
重新创建问题(使用以下代码下载数据):
train <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/train_object.csv"))
train_5fold交叉验证的重采样数据可从以下网址下载:https ://github.com/johnsnow09/covid19-df_stack-code/blob/main/train_5fold.RDS
train_5fold <- readRDS("train_5fold.RDS")
代码:
library(tidyverse)
library(tidymodels)
mset <- metric_set(mn_log_loss)
control <- control_grid(save_workflow = TRUE,
save_pred = TRUE,
extract = extract_model)
xg_spec <- parsnip::boost_tree(
trees = tune(),
mtry = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
factor_to_ordinal <- function(x){
ifelse(x == "Unknown", NA, as.integer(x))
}
xg_rec_4 <- recipe(churned ~ .,data = train) %>%
update_role(id, new_role = "ID") %>%
step_mutate(income_category = factor_to_ordinal(income_category),
education_level = factor_to_ordinal(education_level)) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
xg_wf_4 <- workflow() %>%
add_recipe(xg_rec_4) %>%
add_model(xg_spec)
xg_res_4 <- xg_wf_4 %>%
tune_grid(
resamples = train_5fold,
metrics = mset,
control = control,
grid = crossing(trees = seq(200,800, 20),
mtry = c(2, 4, 6, 8, 10),
learn_rate = c(0.02))
)
)
autoplot(xg_res_4)
错误:所有模型均失败。请参阅 .notes 列。
在.notes我得到
.notes
<chr>
preprocessor 1/1: Error: Problem with `mutate()` column `income_category`.\ni `income_category = factor_to_ordinal(income_category)`.\nx could not find function "factor_to_ordinal"
交叉检查:
xg_rec_4 %>% prep() %>% juice()
# A tibble: 5,316 x 15
id customer_age education_level income_category total_relationship~ months_inactive_1~ credit_limit
<dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>
1 9168 46 3 5 3 3 2171
2 2187 51 4 4 3 1 11373
3 5659 48 3 4 4 2 14322
4 447 57 6 2 5 3 12291
5 6342 39 4 5 5 2 1862
6 496 56 6 5 4 3 3219
7 7064 33 4 1 6 3 27499
8 3978 48 4 4 1 2 34516
9 13 41 4 5 4 3 2372
10 8242 46 3 2 4 3 3115
# ... with 5,306 more rows, and 8 more variables: total_revolving_bal <dbl>, total_amt_chng_q4_q1 <dbl>,
# total_trans_amt <dbl>, total_trans_ct <dbl>, total_ct_chng_q4_q1 <dbl>, avg_utilization_ratio <dbl>,
# churned <fct>, gender_M <dbl>
colSums(xg_rec_4 %>% prep() %>% juice() %>% select_if(is.numeric) %>% is.na())
id customer_age education_level income_category
0 0 0 0
total_relationship_count months_inactive_12_mon credit_limit total_revolving_bal
0 0 0 0
total_amt_chng_q4_q1 total_trans_amt total_trans_ct total_ct_chng_q4_q1
0 0 0 0
avg_utilization_ratio gender_M
0 0
在视频中它为大卫罗宾逊工作的地方: