0

我们如何通过对分组的行执行 OHLC 操作来聚合 DataFrame 中一天下午 4 点之后到第二天上午 10 点之前的所有行?

这会将原始 DataFrame 从

   symbol          datetime  open  high  low  close   date  toCombine
0    AAPL  2020-01-01 15:00     3     5    2      4  2020-01-01      False
1    AAPL  2020-01-01 15:30     4    10    4      8  2020-01-01      False
2    AAPL  2020-01-01 16:00     8    15    6     12  2020-01-01      False
3    AAPL  2020-01-01 18:00    12    20    8     16  2020-01-01       True
4    AAPL  2020-01-01 20:00    12    20    8     16  2020-01-01       True
5    AAPL  2020-01-02 07:00    15    24    9     19  2020-01-02       True
6    AAPL  2020-01-02 10:00    16    25   10     20  2020-01-02       True
7    AAPL  2020-01-02 12:00    20    30   12     24  2020-01-02      False
8    AAPL  2020-01-02 14:00    24    70   14     26  2020-01-02      False
9    AAPL  2020-01-02 16:00   103   105  102    104  2020-01-02      False
10   AAPL  2020-01-02 18:00   104   100  104    196  2020-01-02       True
11   AAPL  2020-01-03 08:00   108   110  106    112  2020-01-03       True
12   AAPL  2020-01-03 10:30   112   120  108    116  2020-01-03      False
13   AAPL  2020-01-03 13:00   115   124  109    119  2020-01-03      False

(忽略此处的索引值)

   symbol          datetime  open  high  low  close   date  
0    AAPL  2020-01-01 15:00     3     5    2      4  2020-01-01      
1    AAPL  2020-01-01 15:30     4    10    4      8  2020-01-01      
2    AAPL  2020-01-01 16:00     8    15    6     12  2020-01-01      
6    AAPL  2020-01-02 10:00    12    25    8     20  2020-01-02  <---- aggregated row    
7    AAPL  2020-01-02 12:00    20    30   12     24  2020-01-02      
8    AAPL  2020-01-02 14:00    24    70   14     26  2020-01-02
9    AAPL  2020-01-02 16:00   103   105  102    104  2020-01-02 
11   AAPL  2020-01-03 10:00   103   110  100    112  2020-01-03  <---- aggregated 10:00 row created if not exist
12   AAPL  2020-01-03 10:30   112   120  108    116  2020-01-03
13   AAPL  2020-01-03 13:00   115   124  109    119  2020-01-03     

笔记:

  1. toCombine已经创建了 column 来标记将被聚合到单个行中的行,其datetime值的时间为10:00.

  2. 如果datetime时间值为的行10:00不存在,则应创建该行。但是,如果也没有toCombine == True要从中聚合的行,则10:00不需要创建该行。

谢谢!


设置问题的 Python 代码

import pandas as pd

data = [
    ('AAPL', '2020-01-01 15:00', 3, 5, 2, 4, '2020-01-01', False),
    ('AAPL', '2020-01-01 15:30', 4, 10, 4, 8, '2020-01-01', False),
    ('AAPL', '2020-01-01 16:00', 8, 15, 6, 12, '2020-01-01', False),
    ('AAPL', '2020-01-01 18:00', 12, 20, 8, 16, '2020-01-01', True),        
    ('AAPL', '2020-01-01 20:00', 12, 20, 8, 16, '2020-01-01', True),        
    ('AAPL', '2020-01-02 07:00', 15, 24, 9, 19, '2020-01-02', True),       
    ('AAPL', '2020-01-02 10:00', 16, 25, 10, 20, '2020-01-02', True),
    ('AAPL', '2020-01-02 12:00', 20, 30, 12, 24, '2020-01-02', False),
    ('AAPL', '2020-01-02 14:00', 24, 70, 14, 26, '2020-01-02', False),
    ('AAPL', '2020-01-02 16:00', 103, 105, 102, 104, '2020-01-02', False),
    ('AAPL', '2020-01-02 18:00', 104, 100, 104, 196, '2020-01-02', True),     
    ('AAPL', '2020-01-03 08:00', 108, 110, 106, 112, '2020-01-03', True),    
    ('AAPL', '2020-01-03 10:30', 112, 120, 108, 116, '2020-01-03', False),     
    ('AAPL', '2020-01-03 13:00', 115, 124, 109, 119, '2020-01-03', False),     
]

df = pd.DataFrame(data, columns=['symbol', 'datetime', 'open', 'high', 'low', 'close', 'date', 'toCombine'])
print(df)
4

1 回答 1

1

我的方法

#Group and agg
m = df['toCombine']
agg_dict = {'datetime' : 'last',
            'open' : 'first',
            'high' : 'max',
            'low' : 'min',
            'close' : 'last'}

reduce_df = (df.loc[m].groupby(['symbol',(~m).cumsum()],
                                             as_index=False).agg(agg_dict))
print(reduce_df)

  symbol          datetime  open  high  low  close
0   AAPL  2020-01-02 10:00    12    25    8     20
1   AAPL  2020-01-03 08:00   104   110  104    112

如果排序正确,可能不需要按符号分组


#get correct datetime, append and sort
datetime = pd.to_datetime(reduce_df['datetime'])
hours = datetime.dt.hour
df = (reduce_df.assign(datetime = (datetime.mask(hours.gt(16),
                                        datetime.add(pd.to_timedelta(10 + 24 - hours,
                                        unit='h')))
                                 .mask(hours.lt(10),
                                       datetime.add(pd.to_timedelta(10 - hours,
                                                                    unit='h')))),            
                 date = lambda x: x['datetime'].dt.date)
          .assign(datetime = lambda x: x['datetime'].dt.strftime('%Y-%m-%d %H:%M'))
          .append(df.loc[~m])
          .drop(columns='toCombine')
          .sort_values(['symbol','datetime','date'])
)

print(df)



   symbol          datetime  open  high  low  close        date
0    AAPL  2020-01-01 15:00     3     5    2      4  2020-01-01
1    AAPL  2020-01-01 15:30     4    10    4      8  2020-01-01
2    AAPL  2020-01-01 16:00     8    15    6     12  2020-01-01
0    AAPL  2020-01-02 10:00    12    25    8     20  2020-01-02
7    AAPL  2020-01-02 12:00    20    30   12     24  2020-01-02
8    AAPL  2020-01-02 14:00    24    70   14     26  2020-01-02
9    AAPL  2020-01-02 16:00   103   105  102    104  2020-01-02
1    AAPL  2020-01-03 10:00   104   110  104    112  2020-01-03
12   AAPL  2020-01-03 10:30   112   120  108    116  2020-01-03

我们本可以使用dataframe.set_indexanddataframe.asfreqffill之前的 groupby,所以我们以后不必更改日期时间,但我认为性能会相似

如果你想包括 hour=16 你需要改变datetime.mask(hours.ge(16).......

当我们使用.loc包含它时,因为它不包含在toCombine第 9 行 False

于 2020-03-22T23:12:53.677 回答