0

我有一个数据集,其中我的所有数据都是分类的,我想使用一种热编码进行进一步分析。

我想解决的主要问题:

  • 一些单元格在一个单元格中包含许多文本(下面将举例说明)。
  • 一些数值需要更改为因子以进行进一步处理。

具有 3 个标题的数据年龄、信息和目标

mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info =         c("c(\"good\", \"bad\", \"sad\"", 
"c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age", 
"Info", "Target"), row.names = c(NA, 4L), class = "data.frame")

我想为上面显示的所有这些变量创建一个热编码,所以它看起来像下面这样:

       Age_99 Age_10 Age_40 Age_15 good bad sad nice happy joy null okay nice fun wild go Boy Girl 
         1      0       0     0      1   1    1   0     0    0   0   0    0   0   0    0   0   0
         0      1       0     0      0   0    0   1     1    1   0   0    0   0   0    0   0   1

我检查过的一些关于 SO 的问题是thisthis

4

2 回答 2

2

我认为以下应该有效:

library(splitstackshape)
library(magrittr)

suppressWarnings({                               ## Just to silence melt
  mydf %>%                                       ## The dataset
    as.data.table(keep.rownames = TRUE) %>%      ## Convert to data.table
    .[, Info := gsub("c\\(|\"", "", Info)] %>%   ## Strip out c( and quotes
    cSplit("Info", ",") %>%                      ## Split the "Info" column
    melt(id.vars = "rn") %>%                     ## Melt everyting except rn
    dcast(rn ~ value, fun.aggregate = length)    ## Go wide
})
#    rn 10 15 40 99 Boy Girl NULL bad fun go good happy joy nice okay sad wild NA
# 1:  1  0  0  0  1   1    0    0   1   0  0    1     0   0    0    0   1    0  2
# 2:  2  1  0  0  0   0    1    0   0   0  0    0     1   1    1    0   0    0  2
# 3:  3  0  0  1  0   1    0    1   0   0  0    0     0   0    0    0   0    0  4
# 4:  4  0  1  0  0   1    0    0   0   1  1    0     0   0    1    1   0    1  0

这是我使用的示例数据:

mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info = c("c(\"good\", \"bad\", \"sad\"", 
    "c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
    ), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age", 
    "Info", "Target"), row.names = c(NA, 4L), class = "data.frame")
于 2016-03-18T15:49:34.950 回答
0

您可以使用该grepl函数扫描每个字符串以查找您要查找的任何内容,并使用它ifelse来适当地填充列。就像是:

 # This will create a new column labeled 'good' with 1 if the string contains and 0 if not 
 data$good =  ifelse(grepl("good",data$info),1, 0)
 # and do this for each variable of interest 

最后,您可以根据需要删除该info列。这样您就不必制作任何新的数据表。

 data$info <- NULL

请注意,您应该将“数据”更改为数据集的实际名称。至于年龄的问题,不用换成因子,直接用:

data$age99 = ifelse(data$Age == 99, 1,0) # and so forth for the other ages

于 2016-03-17T23:41:34.990 回答