r - [R] 中的 One-Hot 编码 | 分类到虚拟变量

Question

我需要创建一个新的数据框nDF，它将所有分类变量二值化，同时将所有其他变量保留在数据框DF中。例如，我有以下特征变量：RACE（4 种类型）和 AGE，以及一个名为 CLASS 的输出变量。

东风 =

              种族年龄（21岁以下）班级
案例 1 西班牙裔 0 A
案例 2 亚洲 1 A
案例 3 西班牙裔 1 D
案例 4 白种人 1 B

我想将其转换为带有五 (5) 个变量或四 (4) 个变量的 nDF：

          RACE.1 RACE.2 RACE.3 年龄（21 岁以下）等级
案例 1 0 0 0 0 A
案例 2 0 0 1 1 A
案例 3 0 0 0 1 D
案例 4 0 1 0 1 B

我熟悉变量DF $RACE 的处理对比。但是，如果我实施

contrasts(DF$RACE) = contr.treatment(4)

我得到的仍然是三个变量的DF，但变量DF $RACE 具有“对比度”属性。

我最终想要的是一个新的数据框nDF，如上图所示，但是如果一个人有大约 50 个特征变量，其中超过五 (5) 个是分类变量，那么评估它可能会非常乏味。

score 28 · Accepted Answer

dd <- read.table(text="
   RACE        AGE.BELOW.21     CLASS
   HISPANIC          0          A
   ASIAN             1          A
   HISPANIC          1          D
   CAUCASIAN         1          B",
  header=TRUE)


  with(dd,
       data.frame(model.matrix(~RACE-1,dd),
                  AGE.BELOW.21,CLASS))
 ##   RACEASIAN RACECAUCASIAN RACEHISPANIC AGE.BELOW.21 CLASS
 ## 1         0             0            1            0     A
 ## 2         1             0            0            1     A
 ## 3         0             0            1            1     D
 ## 4         0             1            0            1     B

The formula ~RACE-1 specifies that R should create dummy variables from the RACE variable, but suppress the intercept (so that each column represents whether an observation comes from a specified category); the default, without -1, is to make the first column an intercept term (all ones), omitting the dummy variable for the baseline level (first level of the factor) from the model matrix.

More generally, you might want something like

 dd0 <- subset(dd,select=-CLASS)
 data.frame(model.matrix(~.-1,dd0),CLASS=dd$CLASS)

Note that when you have multiple categorical variables you will have to something a little bit tricky if you want full sets of dummy variables for each one. I would think of cbind()ing together separate model matrices, but I think there's also some trick for doing this all at once that I forget ...

r - [R] 中的 One-Hot 编码 | 分类到虚拟变量

1 回答 1

Related

Reference