3

我在 R 中有一些因素,这些因素是形式$100,001 - $150,000为 、over $150,000$25,000等的工资范围,并且想将它们转换为数值(例如,将因素转换$100,001 - $150,000为整数 125000)。

同样,我有我想分配数字的教育类别,例如High School DiplomaCurrent Undergraduate、等(例如,给出比 更高的值)。PhDPhDHigh School Diploma

给定包含这些值的数据框,我该怎么做?

4

3 回答 3

10

用于转换货币

# data
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" , 
    "$25,000"), educ = c("High School Diploma", "Current Undergraduate",
   "PhD"),stringsAsFactors=FALSE)

 # Remove comma and dollar sign
temp <- gsub("[,$]","", df$sal)

# remove text
temp <- gsub("[[:alpha:]]","", temp)

# get average over range
df$ave.sal <- sapply(strsplit(temp , "-") , function(i) mean(as.numeric(i)))


对于您的教育水平 - 如果您想要数字

df$educ.f <- as.numeric(factor(df$educ , levels=c("High School Diploma" ,
          "Current Undergraduate", "PhD")))


df
#                  sal                  educ  ave.sal educ.f
# 1 $100,001 - $150,000   High School Diploma 125000.5      1
# 2       over $150,000 Current Undergraduate 150000.0      2
# 3             $25,000                   PhD  25000.0      3



编辑

缺少 / NA 值应该无关紧要

# Data that includes missing values

df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" , 
                 "$25,000" , NA), educ = c(NA, "High School Diploma", 
"Current Undergraduate", "PhD"),stringsAsFactors=FALSE)

重新运行上述命令得到

df
 #                 sal                  educ  ave.sal educ.f
# 1 $100,001 - $150,000                  <NA> 125000.5     NA
# 2       over $150,000   High School Diploma 150000.0      1
# 3             $25,000 Current Undergraduate  25000.0      2
# 4                <NA>                   PhD       NA      3
于 2014-04-15T23:25:32.577 回答
8

您可以使用car包中的重新编码功能。

例如:

library(car)
df$salary <- recode(df$salary, 
    "'$100,001 - $150,000'=125000;'$150,000'=150000")

有关如何使用此功能的更多信息,请参阅帮助文件。

于 2014-04-16T00:07:26.060 回答
0

我只是制作一个映射到您的因子级别的值向量并将它们映射到。下面的代码是一个比我想要的更不优雅的解决方案,因为我无法弄清楚如何使用一个向量,但如果您的数据不是非常大,这将完成这项工作。假设我们要将 的因子元素映射fact到 中的数字vals

fact<-as.factor(c("a","b","c"))
vals<-c(1,2,3)

#for example:
vals[levels(fact)=="b"]
# gives: [1] 2

#now make an example data frame:
sample(1:3,10,replace=T)
data<-data.frame(fact[sample(1:3,10,replace=T)])
names(data)<-c("myvar")

#our vlookup function:
vlookup<-function(fact,vals,x) {
    #probably should do an error checking to make sure fact 
    #   and vals are the same length

    out<-rep(vals[1],length(x)) 
    for (i in 1:length(x)) {
        out[i]<-vals[levels(fact)==x[i]]
    }
    return(out)
}

#test it:
data$myvarNumeric<-vlookup(fact,vals,data$myvar)

这应该适用于您所描述的内容。

于 2014-04-15T23:52:58.570 回答