python - difflib.SequenceMatcher isjunk 参数不考虑？

Question

在 python difflib 库中，SequenceMatcher 类的行为是否异常，或者我是否误读了假定的行为？

为什么 isjunk 论点在这种情况下似乎没有任何区别？

difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8

difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8

我的理解是，如果省略空格，则比率应为1。

score 2 · Accepted Answer

发生这种情况是因为该ratio函数在计算比率时使用了总序列的长度，但它不使用isjunk. 因此，只要匹配块中的匹配数产生相同的值（有和没有isjunk），比率度量将是相同的。

我假设isjunk由于性能原因没有过滤序列。

def ratio(self):   
    """Return a measure of the sequences' similarity (float in [0,1]).

    Where T is the total number of elements in both sequences, and
    M is the number of matches, this is 2.0*M / T.
    """

    matches = sum(triple[-1] for triple in self.get_matching_blocks())
    return _calculate_ratio(matches, len(self.a) + len(self.b))

self.a并且self.b是传递给 SequenceMatcher 对象的字符串（序列）（在您的示例中为“AA”和“A A”）。该isjunk函数lambda x: x in ' '仅用于确定匹配块。您的示例非常简单，因此两个调用的结果比率和匹配块相同。

difflib.SequenceMatcher(None, "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

相同的匹配块，比例为：M = 2, T = 6 => ratio = 2.0 * 2 / 6

现在考虑以下示例：

difflib.SequenceMatcher(None, "AA ", "A A").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=3, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=3, b=3, size=0)]

现在匹配块不同了，但比例将相同，因为匹配的数量仍然相等：

什么时候isjunk没有：M = 2, T = 6 => ratio = 2.0 * 2 / 6

isjunk什么时候 lambda x: x == ' '：M = 1 + 1, T = 6 => ratio = 2.0 * 2 / 6

最后，不同数量的匹配：

difflib.SequenceMatcher(None, "AA ", "A A ").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=4, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A ").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=2), Match(a=3, b=4, size=0)]

匹配次数不同

什么时候isjunk没有：M = 2, T = 7 => ratio = 2.0 * 2 / 7

isjunk什么时候 lambda x: x == ' '：M = 1 + 2, T = 6 => ratio = 2.0 * 3 / 7

score 0 · Accepted Answer

您可以在排序之前从字符串中删除字符

def withoutJunk(input, chars):
    return input.translate(str.maketrans('', '', chars))

a = withoutJunk('AA', ' ')
b = withoutJunk('A A', ' ')
difflib.SequenceMatcher(None, a, b).ratio()
# -> 1.0

python - difflib.SequenceMatcher isjunk 参数不考虑？

2 回答 2

Related

Reference