python - How does the python difflib.get_close_matches() function work?

Question

The following are two arrays:

import difflib
import scipy
import numpy

a1=numpy.array(['198.129.254.73','134.55.221.58','134.55.219.121','134.55.41.41','198.124.252.101'], dtype='|S15')
b1=numpy.array(['198.124.252.102','134.55.41.41','134.55.219.121','134.55.219.137','134.55.220.45', '198.124.252.130'],dtype='|S15')

difflib.get_close_matches(a1[-1],b1,2)

output:

['198.124.252.130', '198.124.252.102']

shouldnt '198.124.252.102' be the closest match for '198.124.252.101'?

I looked at the documentation where they have specified about some floating type weights but no information on algorithm use.

I am in need to find if the absolute difference between the last two octet is 1 (provided the first three octets are same).

So I am finding the closest string first and then checking that closest string for the above condition.

Is there any other function or way to achieve this? Also how does get_close_matches() behave?

ipaddr doesnt seem to have such a manipulation for ips.

score 7 · Accepted Answer

好吧，文档中有这一部分解释了您的问题：

这不会产生最少的编辑序列，但会产生对人们“看起来正确”的匹配。

为了获得您期望的结果，您可以使用Levenshtein_distance。

但是对于比较 IP，我建议使用整数比较：

>>> parts = [int(s) for s in '198.124.252.130'.split('.')]
>>> parts2 = [int(s) for s in '198.124.252.101'.split('.')]
>>> from operator import sub
>>> diff = sum(d * 10**(3-pos) for pos,d in enumerate(map(sub, parts, parts2)))
>>> diff
29

您可以使用此样式创建比较函数：

from functools import partial
from operator import sub

def compare_ips(base, ip1, ip2):
    base = [int(s) for s in base.split('.')]
    parts1 = (int(s) for s in ip1.split('.'))
    parts2 = (int(s) for s in ip2.split('.'))
    test1 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts1)))
    test2 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts2)))
    return cmp(test1, test2)

base = '198.124.252.101'
test_list = ['198.124.252.102','134.55.41.41','134.55.219.121',
             '134.55.219.137','134.55.220.45', '198.124.252.130']
sorted(test_list, cmp=partial(compare_ips, base))
# yields:
# ['198.124.252.102', '198.124.252.130', '134.55.219.121', '134.55.219.137', 
#  '134.55.220.45', '134.55.41.41']

score 2 · Accepted Answer

来自 difflib 的一些提示：

SequenceMatcher 是一个灵活的类，用于比较任何类型的序列对，只要序列元素是可散列的。基本算法早于 Ratcliff 和 Obershelp 在 1980 年代后期以双曲线名称“格式塔模式匹配”发布的算法，并且比它更高级一些。基本思想是找到不包含“垃圾”元素的最长连续匹配子序列（RO 不处理垃圾）。然后将相同的想法递归地应用于匹配子序列左侧和右侧的序列片段。这不会产生最小的编辑序列，但会产生对人们“看起来正确”的匹配。

关于您根据自定义逻辑比较 IP 的要求。您应该首先验证字符串是否是正确的 ip。然后使用简单的整数算术编写比较逻辑应该是满足您要求的简单任务。根本不需要图书馆。

score 1 · Accepted Answer

difflib提到：

基本算法早于 Ratcliff 和 Obershelp 在 1980 年代后期以双曲线名称“格式塔模式匹配”发布的算法，并且比它更高级一些。

就这可能意味着什么而言，“格式塔模式匹配”维基百科页面可以提供一些答案。此外，在 Wikipedia 页面中difflib，在“应用程序”部分中提到了一些关于 Python 库及其实现的内容。

https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching

python - How does the python difflib.get_close_matches() function work?

3 回答 3

Related

Reference