r - 由于使用 R 配方的新因子水平，如何处理 NA？

Question

我预处理了一个训练数据集 (A)，现在想使用 R 配方为测试集 (B) 重现这些步骤。

问题是，测试集中有新的因子水平，我想忽略：

library(recipes)

(A <- data.frame(a = c(1:19, NA), b = factor(c(rep("l1",18), "l2", NA))))

(B <- data.frame(a = c(1:3, NA), b = factor(c("l1", "l2", NA, "l3"))))

rec.task <- 
  recipe(~ ., data = A) %>% 
  step_unknown(all_predictors(), -all_numeric()) %>% 
  step_medianimpute(all_numeric()) %>%  
  step_other(all_predictors(), -all_numeric(), threshold = 0.1, other=".merged") %>% 
  step_dummy(all_predictors(), -all_numeric()) 

tr.recipe <- prep(rec.task, training = A)
(AA <- juice(tr.recipe))

现在的问题是下表中的 NA：

(BB <- bake(tr.recipe, B))

      a b_.merged
  <dbl>     <dbl>
1     1         0
2     2         1
3     3         1
4    10        NA
Warnmeldung:
There are new levels in a factor: NA

我可以在这些步骤中以某种方式避免它吗？我可以在配方过程中将 NA 归为零吗（我对基本 R 或 dplyr 解决方案不感兴趣）？

score 1 · Accepted Answer

正如 topepo 所解释的， step_novel 函数是一种可能的解决方案。通过以下方式更改分配 rec.task 的代码

rec.task <- 
recipe(~ ., data = A) %>% 
step_novel(all_predictors(), -all_numeric()) %>% 
step_unknown(all_predictors(), -all_numeric()) %>% 
step_medianimpute(all_numeric()) %>%  
step_other(all_predictors(), -all_numeric(), threshold = 0.1, other=".merged") %>% 
step_dummy(all_predictors(), -all_numeric()) %>% 
step_zv(all_predictors())

然后输出将是：

# A tibble: 4 x 2
      a b_.merged
  <dbl>     <dbl>
1     1         0
2     2         1
3     3         1
4    10         1

score 0 · Accepted Answer

0

step_novel() is the solution. See the dummy variables vignette.

于 2019-11-25T14:00:10.317 回答

r - 由于使用 R 配方的新因子水平，如何处理 NA？

2 回答 2

Related

Reference