我有 2 个数据集,使用来自 df1 的数据我想使用 4 个条件识别 df2 中的重复数据。
- 条件:
如果 df1 'Name' 列的一行与 df2 中 'Name' 列的任何行匹配超过 80%
(和)
(df1['Class'] == df2['Class'] (OR) df1['Amt $'] == df2['Amt $'])
(和)
如果 df1 中 'Category' 列的行与 df2 中 'Category' 列的任何行项匹配超过 80%
- 结果:
如果满足所有条件,则仅保留 df2 中的新数据并删除其他行。
df1
Name Class Amt $ Category
Apple 1 5 Fruit
Banana 2 8 Fruit
Cat 3 4 Animal
df2
Index Name Class Amt $ Category
1 Apple is Red 1 5 Fruit
2 Banana 2 8 fruits
3 Cat is cute 3 4 animals
4 Green Apple 1 5 fruis
5 Banana is Yellow 2 8 fruet
6 Cat 3 4 anemal
7 Apple 1 5 anemal
8 Ripe Banana 2 8 frut
9 Royal Gala Apple 1 5 Fruit
10 Cats 3 4 animol
11 Green Banana 2 8 Fruit
12 Green Apple 1 5 fruits
13 White Cat 3 4 Animal
14 Banana is sweet 2 8 appel
15 Apple is Red 1 5 fruits
16 Ginger Cat 3 4 fruits
17 Cat house 3 4 animals
18 Royal Gala Apple 1 5 fret
19 Banana is Yellow 2 8 fruit market
20 Cat is cute 3 4 anemal
- 我试过的代码:
for i in df1['Name']:
for u in df2['Name']:
for k in df1['Class']:
for l in df2['Class']:
for m in df1['Amt $']:
for n in df2['Amt $']:
for o in df1['Category']:
for p in df2['Category']:
if SequenceMatcher(None, i, u).ratio() > .8 and k == l and m == n and SequenceMatcher(None, o, p).ratio() > 0.8:
print(i, u)
所需的输出数据框应如下所示:
Name Class Amt $ Category
Apple is Red 1 5 Fruit
Banana 2 8 fruits
Cat is cute 3 4 animals
Green Apple 1 5 fruis
Banana is Yellow 2 8 fruet
Cat 3 4 anemal
Ripe Banana 2 8 frut
Royal Gala Apple 1 5 Fruit
Cats 3 4 animol
Green Banana 2 8 Fruit
Green Apple 1 5 fruits
White Cat 3 4 Animal
Apple is Red 1 5 fruits
Cat house 3 4 animals
Banana is Yellow 2 8 fruit market
Cat is cute 3 4 anemal
请帮助我找到最佳解决方案!:)