0

好吧,老实说,我已经阅读了 step_num2factor 的函数参考,并没有弄清楚如何正确使用它。

temp_names <- as.character(unique(sort(all_raw$MSSubClass)))

price_recipe <-
     recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels  = temp_names)


temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data

class(all_raw$MSSubClass)
# > col_double() 
MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES

使用 step 后数据输出temp_data$MSSubClass全是 NA。obs 保存为 20,30,40.... 190,我想转换为名称(甚至是相同的数字,但作为无序因子)

如果你知道更多关于 step_num2factor 使用的博客文章或一些使用的代码,我也很乐意看到。

完整的数据集由 kaggle 提供: kaggle data

提前谢谢,

4

1 回答 1

1

我不认为这step_num2factor()最适合这个变量。再次查看帮助,并注意您需要提供一个transform参数,该参数可用于在确定级别之前修改数值。如果这些数据都是 10 的倍数,这将可以正常工作,但是您有一些值,例如 75 和 85,所以我认为您不希望这样。此配方步骤最适用于数字/整数变量,您可以使用简单的函数更轻松地将其转换为一组整数。

相反,我认为您应该考虑对step_mutate()因子类型进行简单的强制:

library(tidyverse)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#> 
#>     fixed
#> The following object is masked from 'package:stats':
#> 
#>     step

train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_character(),
#>   Id = col_double(),
#>   MSSubClass = col_double(),
#>   LotFrontage = col_double(),
#>   LotArea = col_double(),
#>   OverallQual = col_double(),
#>   OverallCond = col_double(),
#>   YearBuilt = col_double(),
#>   YearRemodAdd = col_double(),
#>   MasVnrArea = col_double(),
#>   BsmtFinSF1 = col_double(),
#>   BsmtFinSF2 = col_double(),
#>   BsmtUnfSF = col_double(),
#>   TotalBsmtSF = col_double(),
#>   `1stFlrSF` = col_double(),
#>   `2ndFlrSF` = col_double(),
#>   LowQualFinSF = col_double(),
#>   GrLivArea = col_double(),
#>   BsmtFullBath = col_double(),
#>   BsmtHalfBath = col_double(),
#>   FullBath = col_double()
#>   # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.

price_recipe <-
  recipe(SalePrice ~ ., data = train_raw) %>%
  step_mutate(MSSubClass = factor(MSSubClass))

juiced_price <- prep(price_recipe) %>%
  juice()

levels(juiced_price$MSSubClass)
#>  [1] "20"  "30"  "40"  "45"  "50"  "60"  "70"  "75"  "80"  "85"  "90"  "120"
#> [13] "160" "180" "190"

juiced_price %>%
  count(MSSubClass)
#> # A tibble: 15 x 2
#>    MSSubClass     n
#>    <fct>      <int>
#>  1 20           536
#>  2 30            69
#>  3 40             4
#>  4 45            12
#>  5 50           144
#>  6 60           299
#>  7 70            60
#>  8 75            16
#>  9 80            58
#> 10 85            20
#> 11 90            52
#> 12 120           87
#> 13 160           63
#> 14 180           10
#> 15 190           30

reprex 包(v0.3.0)于 2020-05-03 创建

在我看来,这可以让您获得所需的因子水平。如果您想将.txt文件中的这些字符串(例如“1-STORY 1945 & OLDER”)保存为new_levels向量,您可以说factor(MSSubClass, levels = new_levels).

于 2020-05-04T00:03:40.570 回答