r - 一复变量的热编码

Question

我有一个数据集，其中我的所有数据都是分类的，我想使用一种热编码进行进一步分析。

我想解决的主要问题：

一些单元格在一个单元格中包含许多文本（下面将举例说明）。
一些数值需要更改为因子以进行进一步处理。

具有 3 个标题的数据年龄、信息和目标

mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info =         c("c(\"good\", \"bad\", \"sad\"", 
"c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age", 
"Info", "Target"), row.names = c(NA, 4L), class = "data.frame")

我想为上面显示的所有这些变量创建一个热编码，所以它看起来像下面这样：

       Age_99 Age_10 Age_40 Age_15 good bad sad nice happy joy null okay nice fun wild go Boy Girl 
         1      0       0     0      1   1    1   0     0    0   0   0    0   0   0    0   0   0
         0      1       0     0      0   0    0   1     1    1   0   0    0   0   0    0   0   1

我检查过的一些关于 SO 的问题是this和this。

score 2 · Accepted Answer

我认为以下应该有效：

library(splitstackshape)
library(magrittr)

suppressWarnings({                               ## Just to silence melt
  mydf %>%                                       ## The dataset
    as.data.table(keep.rownames = TRUE) %>%      ## Convert to data.table
    .[, Info := gsub("c\\(|\"", "", Info)] %>%   ## Strip out c( and quotes
    cSplit("Info", ",") %>%                      ## Split the "Info" column
    melt(id.vars = "rn") %>%                     ## Melt everyting except rn
    dcast(rn ~ value, fun.aggregate = length)    ## Go wide
})
#    rn 10 15 40 99 Boy Girl NULL bad fun go good happy joy nice okay sad wild NA
# 1:  1  0  0  0  1   1    0    0   1   0  0    1     0   0    0    0   1    0  2
# 2:  2  1  0  0  0   0    1    0   0   0  0    0     1   1    1    0   0    0  2
# 3:  3  0  0  1  0   1    0    1   0   0  0    0     0   0    0    0   0    0  4
# 4:  4  0  1  0  0   1    0    0   0   1  1    0     0   0    1    1   0    1  0

这是我使用的示例数据：

mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info = c("c(\"good\", \"bad\", \"sad\"", 
    "c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
    ), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age", 
    "Info", "Target"), row.names = c(NA, 4L), class = "data.frame")

score 0 · Accepted Answer

您可以使用该grepl函数扫描每个字符串以查找您要查找的任何内容，并使用它ifelse来适当地填充列。就像是：

 # This will create a new column labeled 'good' with 1 if the string contains and 0 if not 
 data$good =  ifelse(grepl("good",data$info),1, 0)
 # and do this for each variable of interest

最后，您可以根据需要删除该info列。这样您就不必制作任何新的数据表。

 data$info <- NULL

请注意，您应该将“数据”更改为数据集的实际名称。至于年龄的问题，不用换成因子，直接用：

data$age99 = ifelse(data$Age == 99, 1,0) # and so forth for the other ages

r - 一 复变量的热编码

2 回答 2

Related

Reference

r - 一复变量的热编码