0

同事,

也许您可以帮助我完成看似简单的任务,但我还没有足够的经验来解决这个问题。

假设我们有两个数据框:

  1. df1 包含子字符串;
  2. df2 包含更长的文本块,其中一些包含来自 df1 的子字符串。
df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}

df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}

df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

这是我需要的:

  1. 我需要遍历行以检查 df1['subst'] 中的子字符串是否存在于 df2['strng'] 的任何位置
  2. 如果它存在于 df2 中,我希望 df2 中的新列 ['match_df1'] 包含来自 df1 的子字符串值。

df2 中的最终输出看起来像这样

字符串 匹配_df1
勒布朗詹姆斯得分 20 勒布朗·詹姆斯
死了三倍的约翰是厨师的地狱 三倍死亡
真实的不是你想的那样 真实不是
五乘五不是勒布朗的得分 五乘五
4

1 回答 1

0

正如@Chris 所注意到的,这个答案可能会完成这项工作。
然后像这样过滤空字符串:

>>> for ind1 in df1.index:
...    df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
>>> df1[df1['strng'].str.len() > 0]
    subst                strng
2   FIVE TIMES FIVE      FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED
4   TRUE IS NOT          TRUE IS NOT WHAT YOU THINK
6   LEBRON JAMES         LEBRON JAMES SCORED 20

所有代码:

import pandas as pd

df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}
df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}

df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

for ind1 in df1.index:
   df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
df1[df1['strng'].str.len() > 0]
于 2021-09-21T13:50:54.917 回答