python - 如何将一个变量中包含的文本与另一个变量匹配

Question

所以，假设我有这行代码

x = 'My name is James Bond'
y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
from difflib import SequenceMatcher as sm
sm(None, x, y)

现在，返回的比率是 0.47191011235955055，这是公平的。

我的问题是 - x 完全存在于 y 中。我希望得到一个失败的高匹配。换个角度看，我基本上是在寻找某种剽窃检测。

更新：更具体。在上面的示例中，我预计匹配率为 100%，因为 x 完全存在于 y 中。但是，在每个示例中，这可能不是一个明确的案例。

另一个例子：

x = "My name is James Herbert Bond"

这里 x 有一个额外的词，所以一些匹配方法会给我一个不太理想的匹配百分比（比如 90%），因为在 x 中只有一个名为“Herbert”的额外词在 y 中不存在。

score 0 · Accepted Answer

对不重叠匹配子序列的长度求和并除以第一个序列的长度。

from difflib import SequenceMatcher
x = 'My name is James Bond'
y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
ratio = sum([i.size for i in SequenceMatcher(None, x, y).get_matching_blocks()])/len(x)
print(ratio)

这将得到 1.0 的输出

score 0 · Accepted Answer

我建议您查看模块中partial_ratio的方法。fuzzywuzzy

>>> x = 'My name is James Bond'
>>> y = 'My name is James Bond and I am an MI-6 agent stationed in London, UK'
>>> 
>>> from fuzzywuzzy import fuzz
>>> fuzz.partial_ratio(x, y)
100
>>> 
>>> x = "My name is James Herbert Bond"
>>> fuzz.partial_ratio(x, y)
72

python - 如何将一个变量中包含的文本与另一个变量匹配

2 回答 2

Related

Reference