我正在使用套索回归将某些文本分类为与 AI 相关或不相关。vip
当我使用和计算变量重要性时tidymodels
,符号与预期相反——“机器”、“学习”和“算法”等词带有负号。
抱歉缺少reprex,但这是我的代码:
fy21_raw %>%
sample_n(5)
# A tibble: 5 x 3
# prog_title text artificial_intel
# <chr> <chr> <fct>
#1 Advanced Batt~ "ABMS l~ not
#2 Energy Effici~ "This e~ not
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not
# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"
set.seed(123)
budget_split <- initial_split(fy21_raw, strata = artificial_intel)
budget_train <- training(budget_split)
budget_test <- testing(budget_split)
set.seed(234)
budget_folds <- vfold_cv(budget_train, strata = artificial_intel, v = 5)
budget_rec <- recipe(artificial_intel ~ ., data = budget_train) %>% # update dv with actual name
update_role(prog_title, new_role = "id") %>%
step_tokenize(text) %>%
step_tokenfilter(text, max_tokens = 1000) %>%
step_upsample(artificial_intel) %>% # update dv with actual name
step_tfidf(text) %>%
step_normalize(recipes::all_predictors())
budget_wf <- workflow() %>%
add_recipe(budget_rec)
lasso_spec <- logistic_reg(penalty = 0.1, mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
set.seed(1234)
lasso_res <- budget_wf %>%
add_model(lasso_spec) %>%
fit_resamples(resamples = budget_folds,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE, pkgs = c('textrecipes')))
set.seed(123)
budget_imp <- budget_wf %>%
add_model(lasso_spec) %>%
fit(budget_train) %>%
pull_workflow_fit() %>%
vi()
# A tibble: 1,000 x 3
# Variable Importance Sign
# <chr> <dbl> <chr>
# 1 tfidf_text_machine -6.82 NEG
# 2 tfidf_text_artificial -5.84 NEG
# 3 tfidf_text_learning -3.69 NEG
它是在计算相对于“非”结果而不是“artificial_intel”的重要性吗?