2

我想检查一个数据框中的列中的值是否存在于第二个数据框的列中。如果存在,则将该值添加到第二个数据帧中同一行的新列中。所有值都是字符串值。两个数据框的大小都不同。第二个数据框也有大约 70 万条记录。所以我拥有的数据框:

DF1

THINGS
book+pen
CAR 
chair
laptop

DF2

Description
I want a new book.
I will pen down this things 
A quick ride in my new car.
Cars are awesome.
My laptop's memory is bad.
Maybe try sitting on that CHAIR.

我想要的输出是添加“更新”列:

Description                        Updated
I want a new book.                 book
I will pen down this things        pen
A quick ride in my new car.        car
Cars are awesome.                  car
My laptop's memory is bad.         laptop
Maybe try sitting on that CHAIR.   chair
Search for that book in my laptop. book+laptop

我已经尝试过蛮力方法,但处理时间太长。提前致谢!

4

1 回答 1

1

请试试这个。

  1. 使用str.splitandexplode首先从 df1 中获取要匹配的字符串的整洁列表
  2. 为不区分大小写的匹配创建了两个全部小写的列。
  3. 用于str.findall检索 dfs 之间的匹配字符串。
  4. 剥离括号和引号使用str.strip

代码:

df1 = df1.assign(THINGS=df1['THINGS'].str.split('+')).explode('THINGS')
df1['THINGS2'] = df1.THINGS.str.lower()
df2['Description2'] =  df2.Description.str.lower()
df2['Updated'] = df2.Description2.str.findall('|'.join(df1.THINGS2))
df2['Updated'] = df2.Updated.astype(str).str.strip(to_strip=r'''[|]|\'''')
del df2['Description2']
print(df2)

印刷:

                        Description Updated
0                I want a new book.    book
1      I will pen down this things      pen
2       A quick ride in my new car.     car
3                 Cars are awesome.     car
4           My laptops hangs a lot.  laptop
5  Maybe try sitting on that CHAIR.   chair
于 2020-12-08T14:40:22.187 回答