在difflib.get_close_matches(word, possibilities[, n][, cutoff])
,cutoff
这里有什么用。它如何影响单词匹配?
2 回答
从文档中:
可选参数
cutoff
(默认0.6
)是float
范围内的 a[0, 1]
。得分至少与单词相似的可能性将被忽略。
尝试文档中的示例:
In [11]: import difflib
In [12]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
Out[12]: ['apple', 'ape']
In [13]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'], cutoff=0.1)
Out[13]: ['apple', 'ape', 'puppy']
In [14]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'], cutoff=0.9)
Out[14]: []
有关该算法的详细信息,请参见“模式匹配:格式塔方法”一文。
I came across the same question and I found that "difflib.get_close_matches" uses as foundation the approach on called "Gestalt pattern matching" described by Ratcliff and Obershelp (link below).
The method "difflib.get_close_matches" is based on the class "SequenceMatcher", which in the source code specify this: "SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people."
About the "cutoff". This tells you how close you want to find the match, if "1" then it needs to be exactly the same word, and as going down it's more relax. So for instance, if you choose "0" it will for sure return you the most "similar" work no matter you don't have any similar one, so this would not make much sense on most of the cases. It's then "0.6" the default, as this can give significant results, but its up to any particular solution, you need to test what it works for you based on your vocabulary and specific scenario.
PATTERN MATCHING: THE GESTALT APPROACH http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/DDJ/1988/8807/8807c/8807c.htm
Hope this helps you to understand "difflib.get_close_matches" better.