目标
在 tidymodels 中使用网格搜索创建套索模型。
使用与Julia Silge的工作代码相似/相同的代码以及另一个包含数字和因子变量的数据集。
问题
错误消息(已解决,见编辑)
`x` 和 `y` 必须具有相同的类型和长度
LASSO 没有有效结果
在所有 bootstrap 案例中,标准差均为 Null。
代码
# Data
## Optionally only numeric
data = data[sapply(data, is.numeric)]
# Workflow setup
## Recipe
rec = recipe(PD ~ ., data = data) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric())
## Preparation of the recipe
prep = rec %>% prep()
## Workflow
wf = workflow() %>% add_recipe(rec)
# Lambda grid
lambdas = grid_regular(
penalty(),
levels = 20)
# Bootstrap data
boot = bootstraps(data, times = 5)
# Model
mod = linear_reg(
penalty = tune(),
mixture = 1 # for lasso
) %>% set_engine('glmnet')
# Processing
lasso = tune_grid(
wf %>% add_model(model),
resamples = boot,
grid = lambdas)
错误追溯
- tune_grid(wf %>% add_model(model), resamples = boot, grid = lambdas)
- tune_grid.workflow(wf %>% add_model(model), resamples = boot, grid = lambdas)
- tune_grid_workflow(对象,resamples = resamples,grid = grid,metrics = metrics,pset = param_info,control = control)
- rlang::eval_tidy(code_path)
- tune_mod_with_recipe(重采样、网格、对象、指标、控制)
- pull_metrics(重采样、结果、控制)
- 滑轮(重新采样,水库,“.metrics”)
- full_join(重新采样,pull_vals,by = id_cols)
- full_join.tbl_df(重新采样,pull_vals,by = id_cols)
- `names<-`(` tmp` , value = vars$alias)
- `names<-.rset`(`tmp` , value = vars$alias)
- rset_reconstruct(out, x)
- rset_reconstructable(x, to)
- col_equals_splits(to_names)
- vec_equal(x, "分裂")
笔记
当提供一个特定的 lambda 并且数据适合时,不会发生错误。
如果网格提供的 lambda 不能正确拟合,如何更改网格?
仅使用数字预测变量会导致相同的错误。
编辑
- 可以避免错误消息,`x` 和 `y` 必须具有相同的类型和长度。最初,使用 dplyr 和 rsample 版本 0.8.5 和 0.0.7。将 rsample 降级到 0.0.6 或将 dplyr 升级到 1.0.0 解决了这个问题(感谢 Max Kuhn)。
- LASSO 仍然无法找到合适的拟合。
# Package imports ------
library(readr)
library(tidymodels)
#> ── Attaching packages ──────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.6 ✓ rsample 0.0.7
#> ✓ dplyr 1.0.0 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.1 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.1
#> ✓ parsnip 0.1.1 ✓ yardstick 0.0.6
#> ✓ purrr 0.3.4
#> ── Conflicts ─────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step() masks stats::step()
library(reprex)
# Data ------
# Prepared according to the Blog post by Julia Silge
# https://juliasilge.com/blog/lasso-the-office/
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
office = read_csv(url(urlfile))[-1]
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> .default = col_double()
#> )
#> See spec(...) for full column specifications.
#office_split = initial_split(office, strata = season)
#office_train = training(office_split)
#office_test = testing(office_split)
# Lasso modeling -------
## Recipe and train it
office_rec <- recipe(imdb_rating ~ ., data = office) %>%
#
step_zv(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
prep(strings_as_factors = FALSE) # Training
## Create workflow
wf <- workflow() %>%
add_recipe(office_rec)
## Parameter tuning
set.seed(4653)
### Bootstrapping data for resampling
office_boot <- bootstraps(office, times = 5, strata = season)
### Create lambda seach gird
lambda_grid <- grid_regular(penalty(), levels = 20)
### The model
tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
### Apply the workflow
lasso_grid <- tune_grid(
wf %>% add_model(tune_spec),
resamples = office_boot,
grid = lambda_grid
)
#> ! Bootstrap1: internal: Standardabweichung ist Null
#> ! Bootstrap2: internal: Standardabweichung ist Null
#> ! Bootstrap3: internal: Standardabweichung ist Null
#> ! Bootstrap4: internal: Standardabweichung ist Null
#> ! Bootstrap5: internal: Standardabweichung ist Null
由reprex 包(v0.3.0)于 2020-06-12 创建