r - 有没有办法在 tidymodels 中创建与 tune_grid() 一起使用的自定义指标，允许分组的 data.frame/tibble？

Question

我想做的事

我正在尝试建立一个模型tidymodels来预测药物对细胞系（如细菌）的功效。该模型将根据给定细胞系的功效对药物进行排名，因此我想使用 Spearman 相关性 (ρ) 作为指标。在下面的示例数据集中，每个细胞系（列Sample）由一个字母表示，Q, R, S, ..., Z每个样本用 50 种药物处理。

当我拆分数据进行交叉验证时，每个折叠的训练/测试拆分将具有 >1 个细胞系（例如Q, R，在折叠 1 的测试拆分中），但在计算度量（ρ）时，我想计算它每个细胞系单独，然后对测试拆分中的所有细胞系取平均值，而不是对所有观察结果进行聚合。例如，如果折叠 1 的测试拆分由组成Q, R，那么我想计算 50 种测试药物的 ρ，然后计算 50 种药物测试Q的单独 ρ R，平均这两个 ρ，并将该平均值作为计算的度量折叠 1。

我试过的

我在想我必须计算按Sample列分组的 tibble/data.frame 上的指标，但我不知道如何将该变量传递到tune_grid(). 我不认为我可以在add_formula()创建工作流对象时包含该变量，因为我不希望它作为预测变量。我昨天刚刚发现了 tidymodels，所以也许有一个我不知道的简单解决方案，但到目前为止我还没有在谷歌上找到任何东西。下面的代码是我尝试过的，但显然它不起作用。提前感谢您提供的任何建议。

错误

i Resample1: preprocessor 1/1
✓ Resample1: preprocessor 1/1
i Resample1: preprocessor 1/1, model 1/20
✓ Resample1: preprocessor 1/1, model 1/20
i Resample1: preprocessor 1/1, model 1/20 (predictions)
x Resample1: internal: Error: In metric: `spearman_cor`
unused arguments (truth = ~TargetVariable, estimate = ~.pred, na_rm ...
i Resample2: preprocessor 1/1
✓ Resample2: preprocessor 1/1
i Resample2: preprocessor 1/1, model 1/20
✓ Resample2: preprocessor 1/1, model 1/20
i Resample2: preprocessor 1/1, model 1/20 (predictions)
x Resample2: internal: Error: In metric: `spearman_cor`
unused arguments (truth = ~TargetVariable, estimate = ~.pred, na_rm ...
i Resample3: preprocessor 1/1
✓ Resample3: preprocessor 1/1
i Resample3: preprocessor 1/1, model 1/20
✓ Resample3: preprocessor 1/1, model 1/20
i Resample3: preprocessor 1/1, model 1/20 (predictions)
x Resample3: internal: Error: In metric: `spearman_cor`
unused arguments (truth = ~TargetVariable, estimate = ~.pred, na_rm ...
i Resample4: preprocessor 1/1
✓ Resample4: preprocessor 1/1
i Resample4: preprocessor 1/1, model 1/20
✓ Resample4: preprocessor 1/1, model 1/20
i Resample4: preprocessor 1/1, model 1/20 (predictions)
x Resample4: internal: Error: In metric: `spearman_cor`
unused arguments (truth = ~TargetVariable, estimate = ~.pred, na_rm ...
i Resample5: preprocessor 1/1
✓ Resample5: preprocessor 1/1
i Resample5: preprocessor 1/1, model 1/20
✓ Resample5: preprocessor 1/1, model 1/20
i Resample5: preprocessor 1/1, model 1/20 (predictions)
x Resample5: internal: Error: In metric: `spearman_cor`
unused arguments (truth = ~TargetVariable, estimate = ~.pred, na_rm ...
Warning message:
All models failed. See the `.notes` column.

运行时glmnet_tuning_results：

Warning message:
This tuning result has notes. Example notes on model fitting include:
internal: Error: In metric: `spearman_cor`
unused arguments (truth = ~TargetVariable, estimate = ~.pred, na_rm = ~na_rm)
internal: Error: In metric: `spearman_cor`
unused arguments (truth = ~TargetVariable, estimate = ~.pred, na_rm = ~na_rm)
internal: Error: In metric: `spearman_cor`
unused arguments (truth = ~TargetVariable, estimate = ~.pred, na_rm = ~na_rm)

代码

示例数据集

data = tibble(
  Sample = rep(LETTERS[17:26], each = 50),
  TargetVariable = rnorm(500, mean = 0, sd = 1),
  PredictorVariable1 = rnorm(500, mean = 5, sd = 1),
  PredictorVariable2 = rpois(500, lambda = 5)
)

模型

# Splitting for cross-validation.
set.seed(1026)
folds = group_vfold_cv(data, group = Sample, v = 5)

# Model specification.
glmnet_model = linear_reg(
  mode    = "regression", 
  penalty = tune(), 
  mixture = tune()
) %>%
  set_engine("glmnet")

# Workflow.
glmnet_wf = workflow() %>%
  add_model(glmnet_model) %>% 
  add_formula(TargetVariable ~ . - Sample)

# Grid specification.
glmnet_params = parameters(penalty(), mixture())
set.seed(1026)
glmnet_grid = grid_max_entropy(glmnet_params, size = 20)

# Hyperparameter tuning.
glmnet_tuning_results = tune_grid(
  glmnet_wf,
  resamples = folds,
  grid      = glmnet_grid,
  metrics   = metric_set(spearman_cor),
  control   = control_grid(verbose = TRUE)
)

glmnet_tuning_results %>% show_best(n = 10)

自定义指标

# Vector version.
spearman_cor_vec = function(truth, estimate, na_rm = TRUE) {
  
  spearman_cor_impl = function(truth, estimate) {
    cor(truth, estimate, method = "spearman")
  }
  
  metric_vec_template(
    metric_impl = spearman_cor_impl,
    truth = truth, 
    estimate = estimate,
    na_rm = na_rm,
    cls = "numeric"
  )
}
# Data frame version. 
spearman_cor = function(data) {
  UseMethod("spearman_cor")
}

spearman_cor = new_numeric_metric(spearman_cor, direction = "maximize")

spearman_cor.data.frame = function(data, truth, estimate, na_rm = TRUE) {
  
  data_grouped = data %>%
    group_by(Sample)
  
  metric_summarizer(
    metric_nm = "spearman_cor",
    metric_fn = spearman_cor_vec,
    data = data_grouped,
    truth = !! enquo(truth),
    estimate = !! enquo(estimate), 
    na_rm = na_rm
  )
  
}

会话信息

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.3 (2020-02-29)
#>  os       macOS Catalina 10.15.7      
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/Chicago             
#>  date     2021-08-25                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  backports     1.1.6   2020-04-05 [1] CRAN (R 3.6.2)
#>  cli           3.0.1   2021-07-17 [1] CRAN (R 3.6.2)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 3.6.2)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.0)
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
#>  glue          1.4.0   2020-04-03 [1] CRAN (R 3.6.2)
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.0)
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 3.6.2)
#>  knitr         1.27    2020-01-16 [1] CRAN (R 3.6.0)
#>  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 3.6.2)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 3.6.2)
#>  pillar        1.6.2   2021-07-29 [1] CRAN (R 3.6.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 3.6.2)
#>  Rcpp          1.0.4.6 2020-04-09 [1] CRAN (R 3.6.1)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 3.6.2)
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 3.6.2)
#>  rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 3.6.2)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
#>  stringi       1.4.5   2020-01-11 [1] CRAN (R 3.6.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
#>  styler        1.5.1   2021-07-13 [1] CRAN (R 3.6.2)
#>  tibble        3.1.3   2021-07-23 [1] CRAN (R 3.6.2)
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.6.0)
#>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 3.6.2)
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 3.6.2)
#>  xfun          0.12    2020-01-13 [1] CRAN (R 3.6.0)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.6.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

score 0 · Accepted Answer

为了使您的自定义指标起作用，您只是缺少一些...参数，因此可以传递参数：

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

spearman_cor_vec <- function(truth, estimate, na_rm = TRUE) {
    
    spearman_cor_impl <- function(truth, estimate) {
        cor(truth, estimate, method = "spearman")
    }
    
    metric_vec_template(
        metric_impl = spearman_cor_impl,
        truth = truth, 
        estimate = estimate,
        na_rm = na_rm,
        cls = "numeric"
    )
}


spearman_cor <- function(data, ...) {    ## these dots were missing
    UseMethod("spearman_cor")
}

spearman_cor <- new_numeric_metric(spearman_cor, direction = "maximize")

spearman_cor.data.frame <- function(data, truth, estimate, na_rm = TRUE) {
    
    data_grouped = data %>%
        group_by(Sample)
    
    metric_summarizer(
        metric_nm = "spearman_cor",
        metric_fn = spearman_cor_vec,
        data = data_grouped,
        truth = !! enquo(truth),
        estimate = !! enquo(estimate), 
        na_rm = na_rm
    )
    
}

这使得您可以在数据集上使用此指标，如下所示：


df <- tibble(
    Sample = rep(LETTERS[17:26], each = 50),
    TargetVariable = rnorm(500, mean = 0, sd = 1),
    Pred1 = rnorm(500, mean = 5, sd = 1),
    Pred2 = rpois(500, lambda = 5)
)


df %>% 
    mutate(.pred = TargetVariable + rnorm(500, mean = 0, sd = 0.2)) %>% 
    spearman_cor(TargetVariable, .pred)
#> # A tibble: 10 × 4
#>    Sample .metric      .estimator .estimate
#>    <chr>  <chr>        <chr>          <dbl>
#>  1 Q      spearman_cor standard       0.980
#>  2 R      spearman_cor standard       0.975
#>  3 S      spearman_cor standard       0.983
#>  4 T      spearman_cor standard       0.985
#>  5 U      spearman_cor standard       0.978
#>  6 V      spearman_cor standard       0.963
#>  7 W      spearman_cor standard       0.975
#>  8 X      spearman_cor standard       0.979
#>  9 Y      spearman_cor standard       0.987
#> 10 Z      spearman_cor standard       0.969

^{由reprex 包于 2021-08-31 创建(v2.0.1)}

但是，这并不能完全解决您的问题，因为对于调整函数，我们通常只传递预测变量和结果，而不是任何具有其他角色的额外变量。我对此进行了一些研究，但无法完全找到一种方法来使调整函数具有一个仅用于计算指标而不是用于拟合的变量。我不相信我们现在支持这个；您可能想要创建一个代表，解释您的用例，并在 Tune 存储库上发布一个问题，以便我们可以优先考虑这样的新功能。

r - 有没有办法在 tidymodels 中创建与 tune_grid() 一起使用的自定义指标，允许分组的 data.frame/tibble？

我想做的事

我试过的

错误

代码

示例数据集

模型

自定义指标

会话信息

1 回答 1

Related

Reference