r - tidymodels - predict() 和 fit() 在应用于同一数据集时给出不同的模型性能结果

Question

目前使用该tidymodels框架并努力理解我得到的模型预测和性能结果的一些差异，特别是当我在完全相同的数据集（即模型训练的数据集）上同时使用两者fit时。predict

下面是一个可重现的示例 - 我正在使用单元数据集并在数据上训练一个随机森林 ( rf_fit)。该对象rf_fit$fit$predictions是我评估其准确性的一组预测之一。然后，我rf_fit通过该函数对相同的数据进行预测predict（产量rf_training_pred，我评估其准确性的另一组预测）。

我的问题是 - 为什么这些预测集彼此不同？为什么它们如此不同？

我认为某些事情必须在我不知道的情况下发生，但我希望这些是相同的，因为我假设fit()训练了一个模型（并且有一些与这个训练过的模型相关的预测）然后predict()采取该精确模型并将其重新应用于（在这种情况下）相同的数据 - 因此两者的预测应该是相同的。

我错过了什么？任何建议或帮助理解将不胜感激 - 在此先感谢！

# Load required libraries 
library(tidymodels); library(modeldata) 
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

# Set seed 
set.seed(123)

# Split up data into training and test
data(cells, package = "modeldata")

# Define Model
rf_mod <- rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

# Fit the model to training data and then predict on same training data
rf_fit <- rf_mod %>% 
  fit(class ~ ., data = cells)
rf_training_pred <- rf_fit %>%
  predict(cells, type = "prob")

# Evaluate accuracy 
data.frame(rf_fit$fit$predictions) %>%
  bind_cols(cells %>% select(class)) %>%
  roc_auc(truth = class, PS)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.903

rf_training_pred %>%   
  bind_cols(cells %>% select(class)) %>%
  roc_auc(truth = class, .pred_PS)
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary          1.00

^{由reprex 包于 2021-09-25 创建(v2.0.1)}

score 0 · Accepted Answer

首先，查看文档以了解ranger::ranger()返回的内容，尤其predictions是：

预测的类/值，基于袋外样本（仅限分类和回归）。

这与您在预测最终的整个拟合模型时得到的结果不同。

其次，当您对最终模型进行预测时，无论您对 tidymodels 对象还是底层 ranger 对象进行预测，都会得到相同的结果。

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(modeldata) 

data(cells, package = "modeldata")

cells <- cells %>% select(-case)

# Define Model
rf_mod <- rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

# Fit the model to training data and then predict on same training data
rf_fit <- rf_mod %>% 
  fit(class ~ ., data = cells)

tidymodels_results <- predict(rf_fit, cells, type = "prob")
tidymodels_results
#> # A tibble: 2,019 × 2
#>    .pred_PS .pred_WS
#>       <dbl>    <dbl>
#>  1   0.929    0.0706
#>  2   0.764    0.236 
#>  3   0.222    0.778 
#>  4   0.920    0.0796
#>  5   0.961    0.0386
#>  6   0.0486   0.951 
#>  7   0.101    0.899 
#>  8   0.954    0.0462
#>  9   0.293    0.707 
#> 10   0.405    0.595 
#> # … with 2,009 more rows

ranger_results <- predict(rf_fit$fit, cells, type = "response")
as_tibble(ranger_results$predictions)
#> # A tibble: 2,019 × 2
#>        PS     WS
#>     <dbl>  <dbl>
#>  1 0.929  0.0706
#>  2 0.764  0.236 
#>  3 0.222  0.778 
#>  4 0.920  0.0796
#>  5 0.961  0.0386
#>  6 0.0486 0.951 
#>  7 0.101  0.899 
#>  8 0.954  0.0462
#>  9 0.293  0.707 
#> 10 0.405  0.595 
#> # … with 2,009 more rows

^{由reprex 包于 2021-09-25 创建(v2.0.1)}

注意：这只有效，因为我们使用了非常简单的预处理。正如我们在这里指出的那样，您通常不应该预测基础$fit对象。

r - tidymodels - predict() 和 fit() 在应用于同一数据集时给出不同的模型性能结果

1 回答 1

Related

Reference