0

我有两个数据框,一个是新闻,另一个是股价。两个数据框都有一个“日期”列。我想在 5 天的间隔内合并它们。

假设我的新闻数据框是 df1,另一个价格数据框是 df2。

我的 df1 看起来像这样:

News_Dates             News
2018-09-29     Huge blow to ABC Corp. as they lost the 2012 tax case
2018-09-30     ABC Corp. suffers a loss
2018-10-01     ABC Corp to Sell stakes
2018-12-20     We are going to comeback strong said ABC CEO
2018-12-22     Shares are down massively for ABC Corp.

我的 df2 看起来像这样:

  Dates             Price
2018-10-04           120
2018-12-24           131

我做的第一种合并方法是:

pd.merge_asof(df1_zscore.sort_values(by=['Dates']), df_n.sort_values(by=['News_Dates']), left_on=['Dates'], right_on=['News_Dates'] \
              tolerance=pd.Timedelta('5d'), direction='backward')

结果df是:

  Dates       News_Dates   News                                     Price
2018-10-04    2018-10-01  ABC Corp to Sell stakes                    120
2018-12-24    2018-12-22  Shares are down massively for ABC Corp.    131

我做的第二种合并方式是:

pd.merge_asof(df_n.sort_values(by=['Dates']), df1_zscore.sort_values(by=['Dates']), left_on=['News_Dates'], right_no=['Dates'] \
              tolerance=pd.Timedelta('5d'), direction='forward').dropna()

结果df为:

News_Dates            News                                                Dates      Price
2018-09-29     Huge blow to ABC Corp. as they lost the 2012 tax case    2018-10-04    120
2018-09-30     ABC Corp. suffers a loss                                 2018-10-04    120 
2018-10-01     ABC Corp to Sell stakes                                  2018-10-04    120
2018-12-22     Shares are down massively for ABC Corp.                  2018-12-24    131

两种合并结果都在单独的 dfs 中,但是两种情况下都缺少一些值,例如 10 月 4 日价格的第二种情况,9 月 29 日、9 月 30 日的新闻应该被合并。如果 12 月 24 日价格的情况 2,12 月 20 日的价格也应该被合并。

所以我不太能弄清楚我哪里出错了。

PS 我的目标是将价格 df 与价格日期后最近 5 天内出现的新闻 df 合并。

4

2 回答 2

2

您可以交换左右数据框:

df = pd.merge_asof(
        df1,
        df2,
        left_on='News_Dates',
        right_on='Dates',
        tolerance=pd.Timedelta('5D'),
        direction='nearest'
    )

df = df[['Dates', 'News_Dates', 'News', 'Price']]
print(df)

        Dates News_Dates                                               News Price
0 2018-10-04 2018-09-29  Huge blow to ABC Corp. as they lost the 2012 t... 120
1 2018-10-04 2018-09-30                           ABC Corp. suffers a loss 120
2 2018-10-04 2018-10-01                            ABC Corp to Sell stakes 120
3 2018-12-24 2018-12-20       We are going to comeback strong said ABC CEO 131
4 2018-12-24 2018-12-22            Shares are down massively for ABC Corp. 131
于 2019-09-30T12:25:29.243 回答
0

这是我使用 numpy 的解决方案

df_n = pd.DataFrame([('2018-09-29', 'Huge blow to ABC Corp. as they lost the 2012 tax case'), ('2018-09-30', 'ABC Corp. suffers a loss'), ('2018-10-01', 'ABC Corp to Sell stakes'), ('2018-12-20', 'We are going to comeback strong said ABC CEO'), ('2018-12-22', 'Shares are down massively for ABC Corp.')], columns=('News_Dates', 'News'))
df1_zscore = pd.DataFrame([('2018-10-04', '120'), ('2018-12-24', '131')], columns=('Dates', 'Price'))

df_n["News_Dates"] = pd.to_datetime(df_n["News_Dates"])
df1_zscore["Dates"] = pd.to_datetime(df1_zscore["Dates"])
n_dates = df_n["News_Dates"].values
p_dates = df1_zscore[["Dates"]].values

## substract each pair of n_dates and p_dates and create a matrix
mat_date_compare = (p_dates - n_dates).astype('timedelta64[D]')

## get matrix of boolean for which difference is between 0 and 5 day
## to be used as index for original array
comparision =  (mat_date_compare <= pd.Timedelta("5d")) & (mat_date_compare >= pd.Timedelta("0d"))

## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(n_dates)*len(p_dates))[comparision.ravel()]


## calculate row and column index from cell number to index the df
pd.concat([df1_zscore.iloc[ind//len(n_dates)].reset_index(drop=True), 
           df_n.iloc[ind%len(n_dates)].reset_index(drop=True)], sort=False, axis=1)

结果

Dates   Price   News_Dates  News
0   2018-10-04  120 2018-09-29  Huge blow to ABC Corp. as they lost the 2012 t...
1   2018-10-04  120 2018-09-30  ABC Corp. suffers a loss
2   2018-10-04  120 2018-10-01  ABC Corp to Sell stakes
3   2018-12-24  131 2018-12-20  We are going to comeback strong said ABC CEO
4   2018-12-24  131 2018-12-22  Shares are down massively for ABC Corp.
于 2019-09-30T10:19:09.947 回答