r - 名称匹配 R

Question

我有 2 个名称的数据集。一个有准确的名字，另一个有准确的和修改过的名字

dt_t <- data.table(Name = list("Aaron RAMSEY", "Mesut OEZIL", "Sergio AGUERO"))
dt_f <- data.table(Name = list("Ã–zil Mesut", "Ramsey Aaron", "Kun AgÃ¼ero"))

我正在考虑用 jarowinkler 函数的值（允许计算字符串的相似度）制作一个表，其中 dt_t 在行中，dt_f 在列中，以便 dt_f[i] 被具有最高的 jarowinkler 值。

但我不知道该怎么做，如果可能的话，少说点。

欢迎任何想法

谢谢

score 0 · Accepted Answer

这是使用的解决方案adist：

library(data.table)

dt_t <- data.table(Name = list("Aaron RAMSEY", "Mesut OEZIL", "Sergio AGUERO"))
dt_f <- data.table(Name = list("Ã–zil Mesut", "Ramsey Aaron", "Kun AgÃ¼ero"))

string_dist <- adist(dt_t$Name, dt_f$Name, partial=TRUE, ignore.case=TRUE)

match_idx <- apply(string_dist, 2, which.min)

dt_match <- cbind(dt_t, dt_f[match_idx])

编辑 - - - - - - - - - - - - - - - - -

逐行应用它：

library(data.table)

dt_t <- data.table(Name = (list("Aaron RAMSEY", "Mesut OEZIL", "Sergio AGUERO")))
dt_f <- data.table(Name = list("Ã–zil Mesut", "Ramsey Aaron", "Kun AgÃ¼ero"))

minDistMatch <- function(x, y){
  x <- as.list(x)
  y <- as.list(y)
  y[which.min(adist(x, y, partial=TRUE, ignore.case=TRUE))]
  }

dt_t[, Match := vapply(Name, minDistMatch, list(1L), dt_f$Name)]

r - 名称匹配 R

1 回答 1

Related

Reference