0

问题:

哪些因素可能导致预测区间的覆盖范围比预期的更广?特别是关于带有ranger包的分位数回归森林?

具体上下文+ REPREX:

我通过欧洲防风草tidymodels软件包套件使用分位数回归森林ranger来生成预测区间。我正在查看一个使用ames住房数据的示例,并惊讶地发现在下面的示例中,当在保留数据集上评估时,我的 90% 预测区间的经验覆盖率约为 97%(训练数据的覆盖率甚至更高) .

这更令人惊讶,因为我的模型在保留集上的表现比在训练集上的表现要得多,因此我猜想覆盖率会低于预期,而不是高于预期?

加载库、数据、设置拆分:

```{r}
library(tidyverse)
library(tidymodels)
library(AmesHousing)

ames <- make_ames() %>% 
  mutate(Years_Old = Year_Sold - Year_Built,
         Years_Old = ifelse(Years_Old < 0, 0, Years_Old))

set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75)

ames_train <- training(data_split)
ames_test  <- testing(data_split)
```

指定模型工作流程:

```{r}
rf_recipe <- 
  recipe(
    Sale_Price ~ Lot_Area + Neighborhood  + Years_Old + Gr_Liv_Area + Overall_Qual + Total_Bsmt_SF + Garage_Area, 
    data = ames_train
  ) %>%
  step_log(Sale_Price, base = 10) %>%
  step_other(Neighborhood, Overall_Qual, threshold = 50) %>% 
  step_novel(Neighborhood, Overall_Qual) %>% 
  step_dummy(Neighborhood, Overall_Qual) 

rf_mod <- rand_forest() %>% 
  set_engine("ranger", importance = "impurity", seed = 63233, quantreg = TRUE) %>% 
  set_mode("regression")

set.seed(63233)
rf_wf <- workflows::workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe) %>% 
  fit(ames_train)
```

对训练和保留数据集进行预测:

```{r}
rf_preds_train <- predict(
  rf_wf$fit$fit$fit, 
  workflows::pull_workflow_prepped_recipe(rf_wf) %>% bake(ames_train),
  type = "quantiles",
  quantiles = c(0.05, 0.50, 0.95)
  ) %>% 
  with(predictions) %>% 
  as_tibble() %>% 
  set_names(paste0(".pred", c("_lower", "", "_upper"))) %>% 
  mutate(across(contains(".pred"), ~10^.x)) %>% 
  bind_cols(ames_train)

rf_preds_test <- predict(
  rf_wf$fit$fit$fit, 
  workflows::pull_workflow_prepped_recipe(rf_wf) %>% bake(ames_test),
  type = "quantiles",
  quantiles = c(0.05, 0.50, 0.95)
  ) %>% 
  with(predictions) %>% 
  as_tibble() %>% 
  set_names(paste0(".pred", c("_lower", "", "_upper"))) %>% 
  mutate(across(contains(".pred"), ~10^.x)) %>% 
  bind_cols(ames_test)
```

显示训练数据和保留数据的覆盖率远高于预期的 90%(经验上似乎分别为 ~98% 和 ~97%):

```{r}
rf_preds_train %>%
  mutate(covered = ifelse(Sale_Price >= .pred_lower & Sale_Price <= .pred_upper, 1, 0)) %>% 
  summarise(n = n(),
            n_covered = sum(
              covered
            ),
            covered_prop = n_covered / n,
            stderror = sd(covered) / sqrt(n)) %>% 
  mutate(min_coverage = covered_prop - 2 * stderror,
         max_coverage = covered_prop + 2 * stderror)
# # A tibble: 1 x 6
#       n n_covered covered_prop stderror min_coverage max_coverage
#   <int>     <dbl>        <dbl>    <dbl>        <dbl>        <dbl>
# 1  2199      2159        0.982  0.00285        0.976        0.988

rf_preds_test %>%
  mutate(covered = ifelse(Sale_Price >= .pred_lower & Sale_Price <= .pred_upper, 1, 0)) %>% 
  summarise(n = n(),
            n_covered = sum(
              covered
            ),
            covered_prop = n_covered / n,
            stderror = sd(covered) / sqrt(n)) %>% 
  mutate(min_coverage = covered_prop - 2 * stderror,
         max_coverage = covered_prop + 2 * stderror)
# # A tibble: 1 x 6
#       n n_covered covered_prop stderror min_coverage max_coverage
#   <int>     <dbl>        <dbl>    <dbl>        <dbl>        <dbl>
# 1   731       706        0.966  0.00673        0.952        0.979
```

猜测:

  • 关于ranger包或分位数回归森林的某些东西在估计分位数的方式上过于极端,或者我以某种方式在“极端”方向上过度拟合——导致我高度保守的预测区间
  • 这是此数据集/模型特有的怪癖
  • 我遗漏了某些东西或设置不正确
4

0 回答 0