r - 在 R 中重新编码 150 多个分类变量

Question

我有一个包含一堆城市的文件（到目前为止，183 个），但没有一个城市映射到它们，这是我需要的。对于重新编码分类变量，我通常使用 plyr 的 rename() 函数，但我不想编写一段乱七八糟的代码来重新编码所有这些城市。最近也在学一点python，这个问题听起来有点像字典/哈希表的问题。如果可能的话，我想学习做一些更程序化的事情。

作为第一个镜头，我继续创建了一个 .csv，其中在一个列中包含每个城市的名称，在另一列中包含其县。我希望以某种方式将它与我需要的文件结合在一起，以便可以映射事物。一些最小的代码来说明我的意思：

#key_file: 
LocalityName <- c('Addy', 'Burien', 'Newman Lake', 'Seattle', 'Tacoma')
CountyName <- c('Stevens', 'King', 'Spokane', 'King', 'Pierce')
key <- cbind.data.frame(LocalityName, CountyName)

#real_file:
LocalityName <- c('Seattle', 'Seattle', 'Tacoma', 'Seattle', 'Newman Lake')
CountyName <- rep(NA, length(LocalityName))
Extra_Example_Col <- c('Y', 'Y', 'N', 'N', 'N')
real <- cbind.data.frame(LocalityName, CountyName, Extra_Example_Col)

我尝试在 plyr 中使用 join() 但无法使其工作（如果这是我遵循的正确轨道，我可以使用我的代码进行更新，不确定）。我也知道 sqldf 包，但由于我现在也是第一次学习 SQL，不确定这是否是一种连接类型？我的大脑认为这是一种“一对多”的映射。

我认为现在尝试学习所有这些其他语言让我有点困惑，但它给了我一些关于如何尝试的想法。我首选的解决方案是 R 惯用的。

score 3 · Accepted Answer

3

对于映射，您可以使用merge. 例如：

merge(real, key, by='LocalityName', all.x=TRUE)

于 2014-10-22T21:55:06.663 回答

score 0 · Accepted Answer

如果我正确理解您的问题，您可以使用mergebase R 或joinplyr。例如：

# Key_file: 
LocalityName <- c('Addy', 'Burien', 'Newman Lake', 'Seattle', 'Tacoma')
CountyName <- c('Stevens', 'King', 'Spokane', 'King', 'Pierce')
key <- cbind.data.frame(LocalityName, CountyName)

# Real_file:
LocalityName <- c('Seattle', 'Seattle', 'Tacoma', 'Seattle', 'Newman Lake')
CountyName <- rep(NA, length(LocalityName))
Extra_Example_Col <- c('Y', 'Y', 'N', 'N', 'N')
real <- cbind.data.frame(LocalityName, CountyName, Extra_Example_Col)

# merge
merge(real, key, by = "LocalityName")
##   LocalityName CountyName.x Extra_Example_Col CountyName.y
## 1  Newman Lake           NA                 N      Spokane
## 2      Seattle           NA                 Y         King
## 3      Seattle           NA                 Y         King
## 4      Seattle           NA                 N         King
## 5       Tacoma           NA                 N       Pierce

# plyr::join
join(real, key, by = "LocalityName")
##   LocalityName CountyName Extra_Example_Col CountyName
## 1      Seattle         NA                 Y       King
## 2      Seattle         NA                 Y       King
## 3       Tacoma         NA                 N     Pierce
## 4      Seattle         NA                 N       King
## 5  Newman Lake         NA                 N    Spokane

请注意，使用merge，您会得到CountyName.x和CountyName.y因为同一列存在于两个数据集中。使用join，您有两个名为的列CountyName。您可能不想初始化data.frameCountyName中的列。real例如， havereal <- cbind.data.frame(LocalityName, Extra_Example_Col)或real[["CountyName"]] <- NULL在合并之前删除该列。

score 0 · Accepted Answer

library(data.table)

key  <- as.data.table(key)
real <- as.data.table(real)

## If necessary, make sure your values are strings, not factors, etc
key[, LocalityName := as.character(LocalityName)]
real[, LocalityName := as.character(LocalityName)]

## Set the keys, this is for joining.
##  not to be confused with your object named "key"
setkey(key, LocalityName)
setkey(real, LocalityName)

## Ensure you have a character and not a logical 
key[, CountyName := as.character(CountyName)]
real[, CountyName := as.character(CountyName)]

## The i.X notation indicates to take the value 
##   from the column inside the [brackets]
real[key, CountyName := i.CountyName]

real
#    LocalityName CountyName Extra_Example_Col
# 1:  Newman Lake    Spokane                 N
# 2:      Seattle       King                 Y
# 3:      Seattle       King                 Y
# 4:      Seattle       King                 N
# 5:       Tacoma     Pierce                 N

r - 在 R 中重新编码 150 多个分类变量

3 回答 3

Related

Reference