r - R 2 个巨大数据集之间的相关匹配。即使有拼写错误

Question

我有输入

“我是一个人旅行，我刚带了一张去新加坡、达尔文、珀斯、阿德莱德、墨尔本、布里斯班、黄金成本、悉尼奥普拉、基督城、黄金海岸里奇兰、奥克兰、澳大利亚和斐济的世界机票。它是一个 10 个月的旅程。我将自己去，我并不害怕，但我的朋友和家人似乎反对这个想法。我已经解释说它是安全的，我可能会在途中遇到人，旅馆是没有你们想象的那么糟糕。至少有 1/3 的旅行我会和朋友和家人住在一起。我很兴奋，但人们的悲观观点让我怀疑安全性。我来自英国所以离家很远，他们害怕我遇到麻烦。我从来没有去过美国”

我有一个多达 5000 行的地点列表。像伦敦、新加坡、悉尼、奥克兰、斐济、黄金海岸、悉尼歌剧院、澳大利亚、英国、美国......

问题通过从地点列表匹配从输入中获取地点。有拼写错误和最接近的匹配。需要优化。

输出新加坡|达尔文|珀斯|阿德莱德|墨尔本|布里斯班|黄金海岸|悉尼歌剧院|基督城|奥克兰|澳大利亚|斐济|英国|美国

尝试过的方法

library(RecordLinkage)
library(stringdist)
input=tolower(gsub('[[:punct:]]', " ", input))
Places <- read.delim("\\Data\\Places_List.csv", row.names =NULL,header=TRUE,sep=",")
Places <-as.matrix(Places)
##################Different Methods Tried##########################
ClosestMatch2 = function(string, stringVector){

distance = levenshteinSim(string, stringVector);
stringVector[distance == max(distance)]
}
ClosestMatch2(input,Places)
###############The above 1 doesn't Work##################
ClosestMatch <- function(string,StringVector) {
matches <- agrep(string,StringVector,value=TRUE)
distance <- sdists(string,matches,method = "",weight = c(1, 0, 2))
matches <- data.frame(matches,as.numeric(distance))
matches <- subset(matches,distance==min(distance))
as.character(matches$matches)
}
ClosestMatch(input,Places)
########This work but not proper Results###########
k=as.matrix((sapply(input,agrep,places)))

######这也不起作用

agrep, pmatch , str_detect(wont work for spelling Mistakes) doesn't work for bigger data sets

score 1 · Accepted Answer

Closest match2 有效，此外还添加了字符数差异和子字符串部分匹配以匹配拼写错误

r - R 2 个巨大数据集之间的相关匹配。即使有拼写错误

1 回答 1

Related

Reference