For starters, yes, there is a way to use recipes without providing a formula. To do that you call recipe()
with only the data as an argument and then manually update the roles via update_role()
. This is the recommended approach when the number of variables is very high, as the formula method is memory-inefficient with many variables.
Next, I want to clarify what we mean in tidymodels by "nominal":
Nominal variables include both character and factor.
A numeric variable of all 1s and 0s would not be a nominal variable in tidymodels (would not be selected by all_nominal()
, etc).
Next, I want to point out that I don't think step_nzv()
is going to do what you are hoping here because you are using the term "variance" in a different sense. If you check out the docs, it describes what we mean here by near-zero-variance:
For example, an example of near-zero variance predictor is one that, for 1000 samples, has two distinct values and 999 of them are a single value.
To be flagged, first, the frequency of the most prevalent value over the second most frequent value (called the "frequency ratio") must be above freq_cut
. Secondly, the "percent of unique values," the number of unique values divided by the total number of samples (times 100), must also be below unique_cut
.
The example low_variance_num
variable you made is not particularly low-variance by the definition used in this step; it has lots of unique values.
For reference, here is a demo of how to build a recipe without the formula:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
df <- tibble(num = seq(1000), rand = runif(1000)) %>%
mutate(pred1 = ifelse(num == 1, 1, rand/10000),
pred2 = ifelse(num == 1, 1, 0))
rec <- recipe(df) %>%
update_role(num, new_role = "label") %>%
update_role(rand, pred1, pred2, new_role = "predictor") %>%
step_nzv(all_predictors())
rec %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 1,000 x 3
#> num rand pred1
#> <int> <dbl> <dbl>
#> 1 1 0.842 1
#> 2 2 0.942 0.0000942
#> 3 3 0.977 0.0000977
#> 4 4 0.595 0.0000595
#> 5 5 0.259 0.0000259
#> 6 6 0.454 0.0000454
#> 7 7 0.550 0.0000550
#> 8 8 0.388 0.0000388
#> 9 9 0.702 0.0000702
#> 10 10 0.481 0.0000481
#> # … with 990 more rows
Created on 2021-01-07 by the reprex package (v0.3.0)
The predictor pred2
was removed because it has so few unique values and they are almost all 0. The predictor pred1
was not removed because it has many unique values. I think if I wanted to do the kind of filtering you are describing, I would do it in data cleaning/preparation, not within a feature engineering recipe in a model pipeline.