1

我有一个地址列表。这些地址是由不同的用户输入的,因此写入相同地址的方式有很多差异。例如,

"andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh andheri pump house","weh andheri pump house et","weh, nr. pump house" 

上述向量有 6 个地址。而且几乎所有的都是一样的。我正在尝试找到这些地址之间的匹配项,以便我可以将它们组合在一起并重新编码。

我试过使用agrep和 stringdist 包。使用 agrep 我不确定我是否应该将每个地址作为一个模式并将其与其余地址匹配。从 stringdist 包我做了以下事情:

library(stringdist)
nsrpatt <- df$Address
x <- scan(what=character(), text = nsrpatt, sep=",")
x <- x[trimws(x)!= ""]
y <- ave(x, phonetic(x), FUN = function(.x) .x[1])

以上给了我错误:

In phonetic(x) : soundex encountered 111 non-printable ASCII or non-ASCII
  characters. 

不确定我是否应该从字符向量中删除这些元素或将它们转换为其他格式。

用 agrep 我试过:

for (i in 1:length(nsrpattn)) {
  npat <- agrep(nsrpattn[i], df$address, max=1, v=T)
}

字符向量的长度约为 25000,它会继续运行并使机器停止运行。

如何有效地为每个地址找到最接近的匹配项。

4

1 回答 1

2

您可以对数据进行小型聚类分析。

x <- c("wall street", "Wall-street", "Wall ST", "andheri pump house", 
       "weh, nr. pump house", "Wallstreet", "weh andheri pump house", 
       "Wall Street", "weh andheri pump house et", "andheri at weh pump house", 
       "andheri pump house(mt)")

首先,您需要一个距离矩阵。

# Levenstein Distance
e  <- adist(na.omit(tolower(x)))
rownames(e) <- na.omit(x)

然后,可以运行聚类分析。

hc <- hclust(as.dist(e))  # find distance clusters

导出最佳切割点,例如以图形方式,并“切割树”。

plot(hc)

在此处输入图像描述

# cut tree at specific cluster size, i.e. getting codes of similar objects
smly <- cutree(hc, h=16)

然后您可以构建一个关键数据框,您可以检查匹配是否正确。

key <- data.frame(x=na.omit(x), 
                  smly=factor(smly, labels=c("Wall Street", "Andheri Pump House")),
                  row.names=NULL)  # key data frame
key
#                            x               smly
# 1                wall street        Wall Street
# 2                Wall-street        Wall Street
# 3                    Wall ST        Wall Street
# 4         andheri pump house Andheri Pump House
# 5        weh, nr. pump house Andheri Pump House
# 6                 Wallstreet        Wall Street
# 7     weh andheri pump house Andheri Pump House
# 8                Wall Street        Wall Street
# 9  weh andheri pump house et Andheri Pump House
# 10 andheri at weh pump house Andheri Pump House
# 11    andheri pump house(mt) Andheri Pump House

最后像这样替换你的向量:

x <- key$smly
于 2019-12-31T09:28:18.047 回答