过去,我使用过包含许多有序变量和分类变量的数据集,并成功地进行了一些转换以使它们成为数字。以下是一些使用房价数据的示例。
序数变量
我首先建议根据它们的相对顺序将序数变量更改为数值:
train$Exter.Quality[train$ExterQual == "Excellent"] <- 4
train$Exter.Quality[train$ExterQual == "Good"] <- 3
train$Exter.Quality[train$ExterQual == "Nominal"] <- 2
train$Exter.Quality[train$ExterQual == "Fair"] <- 1
分类变量
根据您正在查看的响应变量的平均值(在我的情况下为销售价格)利用组排名:
nbhdprice <- summarize(group_by(train, Neighborhood),
mean(SalePrice, na.rm=T))
nbhdprice_lo <- filter(nbhdprice, nbhdprice$`mean(SalePrice, na.rm = T)` < 140000)
nbhdprice_med <- filter(nbhdprice, nbhdprice$`mean(SalePrice, na.rm = T)` < 200000 &
nbhdprice$`mean(SalePrice, na.rm = T)` >= 140000 )
nbhdprice_hi <- filter(nbhdprice, nbhdprice$`mean(SalePrice, na.rm = T)` >= 200000)
train$nbhd_price_level[train$Neighborhood %in% nbhdprice_lo$Neighborhood] <- 1
train$nbhd_price_level[train$Neighborhood %in% nbhdprice_med$Neighborhood] <- 2
train$nbhd_price_level[train$Neighborhood %in% nbhdprice_hi$Neighborhood] <- 3
更多示例可以在此处的代码空间中找到:https ://www.kaggle.com/skirmer/fun-with-real-estate-data/code