r - 识别同一行中的相似名称，然后选择模式

Question

我的数据包括一个名称列。一些名称以多达八种不同的方式书写。我尝试使用以下代码对它们进行分组：

groups <- list()
i <- 1
while(length(x) > 0)
{
  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1
}

head(groups)
groups

接下来，我想添加一个新列，该列返回例如最常用的每行名称表示法。结果应如下所示：

      A            B
1. John Snow    John Snow
2. Peter Wright Peter Wright
3. john snow    John Snow
4. John snow    John Snow
5. Peter wright Peter Wright
6. J. Snow      John Snow
7. John Snow    John Snow
etc.

我如何到那里？

score 2 · Accepted Answer

该答案很大程度上基于先前的问题/答案，该问题/答案将字符串分组。该答案仅添加了为每个组查找模式并将正确的模式分配给原始字符串。

## The data
Names = c("John Snow", "Peter Wright",  "john snow",
    "John snow", "Peter wright", "J. Snow", "John Snow")

## Grouping like in the previous question
groups <- list()
i <- 1
x = Names
while(length(x) > 0)
{
  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.25)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1
}

## Find the mode for each group
Modes = sapply(groups, function(x) names(which.max(table(x))))

## Assign the correct mode to each string
StandardName = rep("", length(Names))
for(i in seq_along(groups)) {
    StandardName[Names %in% groups[[i]]] = Modes[i]
}

StandardName
[1] "John Snow"    "Peter wright" "John Snow"    "John Snow"    "Peter wright"
[6] "John Snow"    "John Snow"

您可能需要尝试使用的max.distance参数的正确值agrep。

如果您想将答案添加到 data.frame，只需添加

df$StandardName = StandardName

要编写结果以便可以从 Excel 访问，请使用

write.csv(df, "MyData.csv")

r - 识别同一行中的相似名称，然后选择模式

1 回答 1

Related

Reference