1

我对 Python 相当陌生,我正在尝试使用模糊 wuzzy 进行模糊匹配。我相信我使用 partial_ratio 函数得到的匹配分数不正确。这是我的探索性代码:

>>>from fuzzywuzzy import fuzz
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Barbil')
50

我相信这应该返回 100 分,因为第二个字符串“Barbil”包含在第一个字符串中。当我尝试在第一个字符串的末尾或开头删除几个字符时,我得到的匹配分数为 100。

>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear','Barbil')
100
>>> fuzz.partial_ratio('ect: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Orissa')
100

当第一个字符串的长度变为 199 时,它似乎从 50 分变为 100 分。有没有人知道可能发生什么?

4

1 回答 1

2

这是因为当其中一个字符串为200 个字符或更长时,python 的 SequenceMatcher 中会启用自动垃圾启发式算法。此代码应该适合您:

from difflib import SequenceMatcher

def partial_ratio(s1, s2):
    """"Return the ratio of the most similar substring
    as a number between 0 and 100."""

    if len(s1) <= len(s2):
        shorter = s1
        longer = s2
    else:
        shorter = s2
        longer = s1

    m = SequenceMatcher(None, shorter, longer, autojunk=False)
    blocks = m.get_matching_blocks()

    # each block represents a sequence of matching characters in a string
    # of the form (idx_1, idx_2, len)
    # the best partial match will block align with at least one of those blocks
    #   e.g. shorter = "abcd", longer = XXXbcdeEEE
    #   block = (1,3,3)
    #   best score === ratio("abcd", "Xbcd")
    scores = []
    for (short_start, long_start, _) in blocks:
        long_end = long_start + len(shorter)
        long_substr = longer[long_start:long_end]

        m2 = SequenceMatcher(None, shorter, long_substr, autojunk=False)
        r = m2.ratio()
        if r > .995:
            return 100
        else:
            scores.append(r)

    return max(scores) * 100.0
于 2016-09-29T02:20:31.197 回答