0

正如您从我的代码中看到的那样,我正在尝试将特征选择包含在我的 tidymodels 工作流程中。我正在使用一些 kaggle 数据,试图预测客户流失。

为了将处理应用于测试和训练数据,我在使用 prep() 函数后烘焙配方。

但是,如果我想对 step_select_roc() 函数 top_p 参数进行调整,我不知道之后如何 prep() 配方。像在我的代表中一样应用它会导致错误。

也许我必须调整我的工作流程并分离一些配方任务才能完成工作。实现这一目标的最佳方法是什么?

#### LIBS

suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(themis))
suppressPackageStartupMessages(library(recipeselectors))


#### INPUT

# get dataset from: https://www.kaggle.com/shrutimechlearn/churn-modelling
data <- fread("Churn_Modelling.csv")


# split data
set.seed(seed = 1972) 
train_test_split <-
  rsample::initial_split(
    data = data,     
    prop = 0.80   
  ) 
train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing() 


#### FEATURE ENGINEERING

# Define the recipe
recipe <- recipe(Exited ~ ., data = train_tbl) %>%
  step_rm(one_of("RowNumber", "Surname")) %>%
  update_role(CustomerId, new_role = "Helper") %>%
  step_num2factor(all_outcomes(),
                  levels = c("No", "Yes"),
                  transform = function(x) {x + 1}) %>%
  step_normalize(all_numeric(), -has_role(match = "Helper")) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_corr(all_numeric(), -has_role("Helper")) %>%
  step_nzv(all_predictors()) %>%
  step_select_roc(all_predictors(), outcome = "Exited", top_p = tune()) %>%  
  prep()


# Bake it
train_baked <- recipe %>%  bake(train_tbl)
test_baked <- recipe %>% bake(test_tbl) 
4

2 回答 2

1

您不能prep()使用具有可调参数的配方。prep()其视为fit()模型的类比;如果您没有设置超参数,您将无法拟合模型。

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

rec <- recipe( ~ ., data = USArrests) %>%
  step_normalize(all_numeric()) %>%
  step_pca(all_numeric(), num_comp = tune::tune())

prep(rec, training = USArrests)
#> Error in `prep()`:
#> ! You cannot `prep()` a tuneable recipe. Argument(s) with `tune()`: 'num_comp'. Do you want to use a tuning function such as `tune_grid()`?

reprex 包于 2022-02-22 创建(v2.0.1)

于 2022-02-23T00:58:28.420 回答
0

感谢 Steven Pawley 的帮助,我能够将可调 step_roc 参数集成到我的 tidymodels 模型工作流程中。正如 Julia Silge 所提到的,不可能准备带有可调参数的配方。因此,如果您仍想准备和烘焙您的食谱,您只能在完成模型和食谱后,按照以下示例执行此操作:

suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(recipeselectors))
suppressPackageStartupMessages(library(finetune))

data(cells, package = "modeldata")

cells <- cells %>% select(-case)
set.seed(31)
split <- initial_split(cells, prop = 0.8)
train <- training(split)
test <- testing(split)

rec <-
    recipe(class ~ ., data = train) %>%
    step_corr(all_predictors(), threshold = 0.9) %>% 
    step_select_roc(all_predictors(), outcome = "class", top_p = tune())

# xgboost model
xgb_spec <- boost_tree(
    trees = tune(), 
    tree_depth = tune(), min_n = tune(), 
    loss_reduction = tune(),                    
    sample_size = tune(), mtry = tune(),         
    learn_rate = tune(),                        
    stop_iter = tune()
) %>% 
    set_engine("xgboost") %>% 
    set_mode("classification")

# grid
xgb_grid <- grid_latin_hypercube(
    trees(),
    tree_depth(),
    min_n(),
    loss_reduction(),
    sample_size = sample_prop(),
    finalize(mtry(), train),
    learn_rate(),
    stop_iter(range = c(5L,50L)),
    size = 5
)

rec_grid <- grid_latin_hypercube(
    parameters(rec) %>% 
        update(top_p = top_p(c(0,30))) ,
    size = 5
)

comp_grid <- merge(xgb_grid, rec_grid)

model_metrics <- metric_set(roc_auc)  


rs <- vfold_cv(cells)

ctrl <- control_grid(pkgs = "recipeselectors")

cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
rfe_res <-
    xgb_spec %>% 
    tune_grid(
        preprocessor = rec,
        resamples = rs,
        grid = comp_grid,
        control = ctrl
    )
stopCluster(cl)


best <- rfe_res %>% select_best("roc_auc")

# finalize
final_mod <- finalize_model(xgb_spec, best)
final_rec <- finalize_recipe(rec, best)

# bakery
bake_test <- final_rec %>% prep() %>% bake(new_data = testing(split))
bake_train <- final_rec %>% prep() %>% bake(new_data = training(split))
于 2022-02-23T09:59:22.223 回答