4

我正在尝试使用SequenceMatcher.ratio()来获取两个字符串的相似性:"86418648""86488648"

>>> SequenceMatcher(None,"86418648","86488648").ratio()
0.5

返回的比率是0.5,这比我预期的要低得多,因为两个字符串中只有一个字符不同。

似乎该比率是根据匹配块计算的。所以我尝试运行SequenceMatcher.get_matching_blocks()

>>> SequenceMatcher(None,"86418648","86488648").get_matching_blocks()
[Match(a=4, b=0, size=4), Match(a=8, b=8, size=0)]

但我预计结果是:

[Match(a=0, b=0, size=3), Match(a=4, b=4, size=4), Match(a=8, b=8, size=0)]

谁能帮忙解释为什么它与前 3 个数字不匹配"864"

4

1 回答 1

2

SequenceMatcher.get_matching_blocks()通过重复应用SequenceMatcher.find_longest_match()到两个序列的尚未匹配的块来工作。

从文档字符串中引用find_longest_match()

Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
    alo <= i <= i+k <= ahi
    blo <= j <= j+k <= bhi
and for all (i',j',k') meeting those conditions,
    k >= k'
    i <= i'
    and if i == i', j <= j'

In other words, of all maximal matching blocks, return one that
starts earliest in a, and of all those maximal matching blocks that
start earliest in a, return the one that starts earliest in b.

在两个序列a = "86418648"和的情况下,匹配一个块b = "86488648"的最长块是单个at ,并且它的最早匹配是两个这样的可能匹配中的第一个, at 。ab8648a[4]bb[0]

一旦决定了这个匹配,就不再有任何进一步的匹配,这样,根据提供的保证SequenceMatcher.get_matching_blocks()“三元组在ij中单调递增”

例如,将尚未匹配的864ata[0]与尚未匹配864的 at匹配b[4]将要求i随着j的增加而减小(反之亦然),这违反了上述保证。

于 2018-01-10T15:59:51.360 回答