r - R tidymodels 配方接近数字属性的零方差过滤器

Question

我在使用 R tidymodels 食谱中的 step_nzv 来过滤掉具有小方差但连续值的数字属性时遇到了麻烦。在我看来，该步骤仅适用于名义值，因为它计算唯一值的数量以及最常见与第二常见的比率。但是我有一个属性，它几乎无处不在接近零，从不为零。我是否必须先装箱（并用相同大小的垃圾箱离散化会改变一切）？在下面的代码中，我有一个最小的示例。我希望两个列 low_variance_num 和 low_variance_nom 都被过滤掉，这不会发生：

library(tidymodels)

data <- tibble(num = seq(1000),rand = runif(1000)) %>% 
  mutate(low_variance_num = ifelse(num == 1, 1, rand/10000),
         low_variance_nom = ifelse(num == 1, 1, 0))

data
var(data$low_variance_num)
var(data$low_variance_nom)

recipe <- recipe(formula = num ~., data = data) %>% 
  update_role("num", new_role = "label") %>%
  step_nzv(all_predictors(), freq_cut = 995/5, unique_cut = 10) %>% # 5min bis hier
  prep()
summary(recipe)

PS：有没有办法在不提供配方的情况下使用食谱？在这种情况下，公式是无稽之谈。

score 0 · Accepted Answer

For starters, yes, there is a way to use recipes without providing a formula. To do that you call recipe() with only the data as an argument and then manually update the roles via update_role(). This is the recommended approach when the number of variables is very high, as the formula method is memory-inefficient with many variables.

Next, I want to clarify what we mean in tidymodels by "nominal":

Nominal variables include both character and factor.

A numeric variable of all 1s and 0s would not be a nominal variable in tidymodels (would not be selected by all_nominal(), etc).

Next, I want to point out that I don't think step_nzv() is going to do what you are hoping here because you are using the term "variance" in a different sense. If you check out the docs, it describes what we mean here by near-zero-variance:

For example, an example of near-zero variance predictor is one that, for 1000 samples, has two distinct values and 999 of them are a single value.

To be flagged, first, the frequency of the most prevalent value over the second most frequent value (called the "frequency ratio") must be above freq_cut. Secondly, the "percent of unique values," the number of unique values divided by the total number of samples (times 100), must also be below unique_cut.

The example low_variance_num variable you made is not particularly low-variance by the definition used in this step; it has lots of unique values.

For reference, here is a demo of how to build a recipe without the formula:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

df <- tibble(num = seq(1000), rand = runif(1000)) %>% 
  mutate(pred1 = ifelse(num == 1, 1, rand/10000),
         pred2 = ifelse(num == 1, 1, 0))

rec <- recipe(df) %>% 
  update_role(num, new_role = "label") %>%
  update_role(rand, pred1, pred2, new_role = "predictor") %>%
  step_nzv(all_predictors())

rec %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 1,000 x 3
#>      num  rand     pred1
#>    <int> <dbl>     <dbl>
#>  1     1 0.842 1        
#>  2     2 0.942 0.0000942
#>  3     3 0.977 0.0000977
#>  4     4 0.595 0.0000595
#>  5     5 0.259 0.0000259
#>  6     6 0.454 0.0000454
#>  7     7 0.550 0.0000550
#>  8     8 0.388 0.0000388
#>  9     9 0.702 0.0000702
#> 10    10 0.481 0.0000481
#> # … with 990 more rows

^{Created on 2021-01-07 by the reprex package (v0.3.0)}

The predictor pred2 was removed because it has so few unique values and they are almost all 0. The predictor pred1 was not removed because it has many unique values. I think if I wanted to do the kind of filtering you are describing, I would do it in data cleaning/preparation, not within a feature engineering recipe in a model pipeline.

r - R tidymodels 配方接近数字属性的零方差过滤器

1 回答 1

Related

Reference