1

我的df样子是这样的:

sprint   sprint_created
------   -----------
S100     2020-01-01    
S101     2020-01-10
NULL     2020-01-20
NULL     2020-01-31
S101     2020-01-10
...

在上面df,您可以看到一些sprint值是NULL

我有另一个df2日期sprint范围:

sprint   sprint_start   sprint_end
------   -----------    ----------
S100     2020-01-01     2020-01-09    
S101     2020-01-10     2020-01-19  
S102     2020-01-20     2020-01-29  
S103     2020-01-30     2020-02-09  
S104     2020-02-10     2020-02-19  
...

如何通过比较 中的数据来映射这些数据并填写 中的NULL值?dfdf2

请注意 和 的形状dfdf2不同的。

4

2 回答 2

1

我假设 df 中有重复的 sprint(可以删除第一个数据帧)。如果不是这样,请另外提出建议。根据我对您提供的两个 dfs 的比较,我使用了带有一天容差的合并 asof。如果有,请另行通知

df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y']).dropna()

  sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31

如果您的框架有合法的多次冲刺,如评论中所述。请试试;

g=df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y'])
g.loc[g.sprint.isna(), 'sprint']=g.groupby('sprint_created').sprint.ffill()
print(g)



sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10
于 2021-01-08T00:09:57.343 回答
1

一种方法是 to meltand resampleyour df2and create a dictionary to mapback to df1

#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])

#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
          .groupby('sprint', group_keys=False).resample('D').ffill().reset_index())

#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]: 
  sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10
于 2021-01-08T00:25:43.457 回答