0

我正在使用套索回归将某些文本分类为与 AI 相关或不相关。vip当我使用和计算变量重要性时tidymodels,符号与预期相反——“机器”、“学习”和“算法”等词带有负号。

抱歉缺少reprex,但这是我的代码:

fy21_raw %>%
    sample_n(5)

# A tibble: 5 x 3
#  prog_title     text     artificial_intel
#  <chr>          <chr>    <fct>           
#1 Advanced Batt~ "ABMS l~ not             
#2 Energy Effici~ "This e~ not             
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not 

# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"

set.seed(123)
budget_split <- initial_split(fy21_raw, strata = artificial_intel) 
budget_train <- training(budget_split)
budget_test  <- testing(budget_split)

set.seed(234)
budget_folds <- vfold_cv(budget_train, strata = artificial_intel, v = 5) 

budget_rec <- recipe(artificial_intel ~ ., data = budget_train) %>% # update dv with actual name
    update_role(prog_title, new_role = "id") %>%
    step_tokenize(text) %>%
    step_tokenfilter(text, max_tokens = 1000) %>%
    step_upsample(artificial_intel) %>% # update dv with actual name
    step_tfidf(text) %>%
    step_normalize(recipes::all_predictors())

budget_wf <- workflow() %>%
    add_recipe(budget_rec)

lasso_spec <- logistic_reg(penalty = 0.1, mixture = 1) %>%
    set_mode("classification") %>%
    set_engine("glmnet")

all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(1234)
lasso_res <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit_resamples(resamples = budget_folds,
                  metrics = metric_set(roc_auc, accuracy, sens, spec),
                  control = control_grid(save_pred = TRUE, pkgs = c('textrecipes')))

set.seed(123)
budget_imp <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit(budget_train) %>%
    pull_workflow_fit() %>%
    vi()

# A tibble: 1,000 x 3
#   Variable              Importance Sign 
#   <chr>                      <dbl> <chr>
# 1 tfidf_text_machine        -6.82  NEG  
# 2 tfidf_text_artificial     -5.84  NEG  
# 3 tfidf_text_learning       -3.69  NEG

它是在计算相对于“非”结果而不是“artificial_intel”的重要性吗?

4

1 回答 1

1

来自 glmnet 小插图:

请注意,对于“二项式”模型,仅返回与因子响应的第二级对应的类的结果。

因此,如果您想要正确的系数符号,则 glmnet 的正电平必须是第二个。如果您将 glmnet 与 yardstick 一起使用,请记住 yardstick 使用第一个因子级别作为默认值。因此,您需要设置 yardstick.event_first = FALSE

于 2020-11-17T16:58:22.717 回答