r - 为什么 tidymodels/recipes 中的“id 变量”起到预测作用？

Question

这与使用 step_naomit 的 Predict 和使用 tidymodels 保留 ID的问题相同，但即使有一个可接受的答案，OP 的最后一条评论指出了“id 变量”被用作预测器的问题，正如在查看时可以看到的那样model$fit$variable.importance.

我有一个我想保留的带有“id variables”的数据集。我想我可以通过 recipe() 规范来实现这一点。

library(tidymodels)

# label is an identifier variable I want to keep even though it's not
# a predictor
df <- tibble(label = 1:50, 
             x = rnorm(50, 0, 5), 
             f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
             y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )

df_split <- initial_split(df, prop = 0.70)

# Make up any recipe: just note I specify 'label' as "id variable"
rec <- recipe(training(df_split)) %>% 
  update_role(label, new_role = "id variable") %>% 
  update_role(y, new_role = "outcome") %>% 
  update_role(x, new_role = "predictor") %>% 
  update_role(f, new_role = "predictor") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())

train_juiced <- prep(rec, training(df_split)) %>% juice()

logit_fit <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") %>% 
  fit(y ~ ., data = train_juiced)

# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept)       label           x         f_b         f_c 
#>  1.03664140 -0.01405316  0.22357266 -1.80701531 -1.66285399

^{由reprex 包（v0.3.0）于 2020 年 1 月 27 日创建}

但即使我确实指定label了一个 id 变量，它也被用作预测变量。所以也许我可以在公式中使用我想要的特定术语，并专门添加label为 id 变量。

rec <- recipe(training(df_split), y ~ x + f) %>% 
  update_role(label, new_role = "id variable") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())
#> Error in .f(.x[[i]], ...): object 'label' not found

^{由reprex 包（v0.3.0）于 2020 年 1 月 27 日创建}

我可以试着不提label

rec <- recipe(training(df_split), y ~ x + f) %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes())


train_juiced <- prep(rec, training(df_split)) %>% juice()

logit_fit <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") %>% 
  fit(y ~ ., data = train_juiced)

# Why is label a variable in the model ?
logit_fit[['fit']][['coefficients']]
#> (Intercept)           x         f_b         f_c 
#> -0.98950228  0.03734093  0.98945339  1.27014824

train_juiced
#> # A tibble: 35 x 4
#>          x y       f_b   f_c
#>      <dbl> <fct> <dbl> <dbl>
#>  1 -0.928  Y         1     0
#>  2  4.54   N         0     0
#>  3 -1.14   N         1     0
#>  4 -5.19   N         1     0
#>  5 -4.79   N         0     0
#>  6 -6.00   N         0     0
#>  7  3.83   N         0     1
#>  8 -8.66   Y         1     0
#>  9 -0.0849 Y         1     0
#> 10 -3.57   Y         0     1
#> # ... with 25 more rows

^{由reprex 包（v0.3.0）于 2020 年 1 月 27 日创建}

好的，所以模型有效，但我的label.
我该怎么做？

score 10 · Accepted Answer

您遇到的主要问题/概念问题是，一旦您有了juice()配方，它就只是 data，即实际上只是一个数据框。当您使用它来拟合模型时，模型无法知道某些变量具有特殊作用。

library(tidymodels)

# label is an identifier variable to keep even though it's not a predictor
df <- tibble(label = 1:50, 
             x = rnorm(50, 0, 5), 
             f = factor(sample(c('a', 'b', 'c'), 50, replace = TRUE)),
             y = factor(sample(c('Y', 'N'), 50, replace = TRUE)) )

df_split <- initial_split(df, prop = 0.70)

rec <- recipe(y ~ ., training(df_split)) %>% 
  update_role(label, new_role = "id variable") %>% 
  step_corr(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_predictors(),-all_numeric()) %>% 
  step_meanimpute(all_numeric(), -all_outcomes()) %>%
  prep()

train_juiced <- juice(rec)
train_juiced
#> # A tibble: 35 x 5
#>    label     x y       f_b   f_c
#>    <int> <dbl> <fct> <dbl> <dbl>
#>  1     1  1.80 N         1     0
#>  2     3  1.45 N         0     0
#>  3     5 -5.00 N         0     0
#>  4     6 -4.15 N         1     0
#>  5     7  1.37 Y         0     1
#>  6     8  1.62 Y         0     1
#>  7    10 -1.77 Y         1     0
#>  8    11 -3.15 N         0     1
#>  9    12 -2.02 Y         0     1
#> 10    13  2.65 Y         0     1
#> # … with 25 more rows

请注意，这train_juiced只是一个普通的小标题。如果您使用 tibble 在此 tibble 上训练模型fit()，它不会知道用于转换数据的配方。

tidymodels 框架确实有一种方法可以使用配方中的角色信息来训练模型。可能最简单的方法是使用工作流。

logit_spec <- logistic_reg(mode = "classification") %>%
  set_engine(engine = "glm") 

wf <- workflow() %>%
  add_model(logit_spec) %>%
  add_recipe(rec)

logit_fit <- fit(wf, training(df_split))

# No more label in the model
logit_fit
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#> 
#> ● step_corr()
#> ● step_dummy()
#> ● step_meanimpute()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = formula, family = stats::binomial, data = data)
#> 
#> Coefficients:
#> (Intercept)            x          f_b          f_c  
#>     0.42331     -0.04234     -0.04991      0.64728  
#> 
#> Degrees of Freedom: 34 Total (i.e. Null);  31 Residual
#> Null Deviance:       45 
#> Residual Deviance: 44.41     AIC: 52.41

^{由reprex 包（v0.3.0）于 2020-02-15 创建}

模型中不再有标签！

r - 为什么 tidymodels/recipes 中的“id 变量”起到预测作用？

1 回答 1

Related

Reference