r - 如何匹配一个字符容差的字符串？

Question

我有一个位置向量，我试图消除与正确位置名称向量的歧义。对于这个例子，我只使用了两个明确的位置：

agrepl('Au', c("Austin, TX", "Houston, TX"), 
max.distance =  .000000001, 
ignore.case = T, fixed = T)
[1] TRUE TRUE

帮助页面说max.distance是

一场比赛允许的最大距离。表示为整数或模式长度乘以最大转换成本的分数

我不确定Levensthein距离的数学含义；我的理解是，距离越小，与我的歧义字符串向量不匹配的容忍度就越严格。

所以我会调整它以检索两个FALSE？基本上，我TRUE只想在有 1 个字符的差异时使用，例如：

agrepl('Austn, TX', "Austin, TX", 
max.distance =  .000000001, ignore.case = T, fixed = T)
[1] TRUE

score 1 · Accepted Answer

您遇到的问题可能与我在这里开始实验时遇到的问题相似。当fixed=TRUE 时，第一个参数是正则表达式模式，因此如果不限制为完整字符串，小模式是非常宽松的。帮助页面甚至有关于该问题的“注释”：

由于不小心阅读描述的人甚至提交了错误报告，请注意这匹配 x 的每个元素的子字符串（就像 grep 一样）而不是整个元素。

使用正则表达式模式，您可以通过在pattern字符串两侧加上“^”和“$”来做到这一点，因为与不同adist，agrepl没有部分参数：

> agrepl('^Au$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE
> agrepl('^Austn, TX$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl('^Austn, T$', "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE

所以你需要用这些侧翼粘贴0：

> agrepl( paste0('^', 'Austn, Tx', '$'), "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl( paste0('^', 'Au', '$'), "Austin, TX", 
+ max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
[1] FALSE

可能all比只使用更好insertions，并且您可能希望降低分数。

r - 如何匹配一个字符容差的字符串？

1 回答 1

Related

Reference