python - 字符串相似性，其中 ascii 代码的顺序和差异很重要

Question

有人知道可以为以下给出正确结果的字符串相似性方法吗？我正在处理字母数字 ID，其中：

字符串前半部分的变化比后半部分更重要。我想我可以做ngrams？尽管在一个字符串有前缀的情况下这可能会崩溃？
替换字符的差异很重要，因为将“a”更改为“b”比将其更改为“c”要少。

Levenstein 和 Jaro-Winkler 似乎没有做正确的事。

请参见下面的示例。

import jellyfish
t1="100"
t21=["100a","a100"] # case 1. expecting: similar, not similar
t22=["101","105","200"] # case 2. expecting: similar, less similar, least similar

fun = jellyfish.levenshtein_distance
print([fun(t1, t) for t in t21]) # all the same
print([fun(t1, t) for t in t22]) # all the same

fun = jellyfish.jaro_winkler
print([fun(t1, t) for t in t21]) # all the same
print([fun(t1, t) for t in t22]) # all the same

为了增加乐趣，第一个字符串的前缀与作为 ID 的字符串本质上无关，但会混淆字符串相似性。

t1="pre-100"
t21=["100a","a100"] # expecting: similar, not similar
t22=["101","105","200"] # expecting: similar, less similar, least similar

fun = jellyfish.levenshtein_distance
print([fun(t1, t) for t in t21]) # picks the wrong one
print([fun(t1, t) for t in t22]) # all the same

fun = jellyfish.jaro_winkler
print([fun(t1, t) for t in t21]) # picks the wrong one
print([fun(t1, t) for t in t22]) # picks the right one

python - 字符串相似性，其中 ascii 代码的顺序和差异很重要

0 回答 0

Related

Reference