python - 使用 Difflib get_matching_blocks 的模糊字符串匹配未检测到所有子字符串

Question

我正在尝试查找段落中所有出现的单词，并且我希望它也能解决拼写错误。代码：

to_search="caterpillar"
search_here= "caterpillar are awesome animal catterpillar who like other humans but not other caterpilar"
#search_here has the word caterpillar repeated but with spelling mistakes

s= SequenceMatcher(None, to_search, search_here).get_matching_blocks()
print(s)

#Output  : [Match(a=0, b=0, size=11), Match(a=3, b=69, size=0)] 
#Expected: [Match(a=0, b=0, size=11), Match(a=0, b=32, size=11), Match(a=0, b=81, size=11)]

Difflib get_matching_blocks 仅检测 search_here 字符串中“caterpillar”的第一个实例。我希望它给我所有紧密匹配的块的输出，即它应该识别“卡特彼勒”、“卡特彼勒”和“卡特彼勒”

我怎么解决这个问题？

score 0 · Accepted Answer

您可以计算每个单词与 to_search 的编辑距离。然后，您可以选择所有具有“足够低”编辑距离的单词（得分为 0 表示完全匹配）。

感谢您的问题，我发现有一个 pip-install-able edit_distance Python 模块。这是我第一次尝试的几个示例：

>>> edit_distance.SequenceMatcher('fabulous', 'fibulous').ratio()
0.875
>>> edit_distance.SequenceMatcher('fabulous', 'wonderful').ratio()
0.11764705882352941
>>> edit_distance.SequenceMatcher('fabulous', 'fabulous').ratio()
1.0
>>> edit_distance.SequenceMatcher('fabulous', '').ratio()
0.0
>>> edit_distance.SequenceMatcher('caterpillar', 'caterpilar').ratio()
0.9523809523809523

因此，看起来比率方法为您提供了一个介于 0 和 1（含）之间的数字，其中 1 是完全匹配的，而 0 是......甚至不在同一个联赛中 XD。所以，是的，您可以选择比率大于 1 - epsilon 的单词，其中 epsilon 可能是 0.1 左右。

python - 使用 Difflib get_matching_blocks 的模糊字符串匹配未检测到所有子字符串

1 回答 1

Related

Reference