python - 带有句子的 Difflib 序列匹配器

Question

我有以下数据框

Column1         Column2
tomato fruit    tomatoes are not a fruit
potato la best  potatoe are some sort of fruit
apple           there are great benefits to appel
pear            peer

我想查找左边的单词/句子和右边的句子，如果最大前两个单词匹配（例如'potato la'并省略'best'），那么它会给出一个分数。

我已经使用了两种不同的方法：

for i in range(0, len(Column1)):
     store_it = SM(None, Column1[i], Column2[i]).get_matching_blocks()
     print(store_it)

和

df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)

我在互联网上找到的。

第二个工作正常，除了它试图匹配整个短语。如何将第一列中的单词与第二列中的句子匹配，以便最终给我一个“是”它们在句子中（或部分）或“不”它们不是。

score 1 · Accepted Answer

我在这个上使用 FuzzyWuzzy 的部分比率获得了最大的成功。它将为您提供 Column1“番茄果实”和 Column2“番茄不是水果”之间的部分匹配百分比以及沿列的其余部分。查看结果：

from fuzzywuzzy import fuzz
import difflib

df['fuzz_partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['Column1'], x['Column2']), axis=1)

df['sequence_ratio'] = df.apply(lambda x: difflib.SequenceMatcher(None, x['Column1'], x['Column2']).ratio(), axis=1)

您可以认为任何 FuzzyWuzzy 得分 > 60 都是很好的部分匹配，即是，Column1 中的单词最有可能出现在 Column2 中的句子中。

第 1 行 - 67 分，第 2 行 - 71 分，第 3 行 - 80 分，第 4 行 - 75 分

score 0 · Accepted Answer

使用set()：

Python » Documentation
issubset(other)
set <= other
测试集合中的每个元素是否在 other 中。

例如：

c_set1 = set(Column1[i])
c_set2 = set(Column2[i])
if  c_set1.issubset(c_set2):
    # every in  c_set1 is in  c_set2

python - 带有句子的 Difflib 序列匹配器

2 回答 2

Related

Reference