我有一个 pandas 多级数据框df
,其中包含从 2006 年到 2012 年大约 2000 多只股票的季度财务报告数据。我正在尝试找出一种方法来快速计算每个数据点的“平均值” 。
demo_data()
是生成demo数据的函数(df = demo_data(stk_qty=2000, col_num=200)
可以用来模拟财报数据):
def demo_data(stk_qty, col_num):
''' generate demo data, return multilevel dataframe '''
import random
import pandas as pd
rpt_date_template = [(yr+qt) for yr in map(str, range(2006, 2013)) for qt in ['0331','0630','0930','1231']]
stk_id_list = ['STK'+str(x).zfill(3) for x in range(0, stk_qty)]
stk_id_column, rpt_date_column = [], []
for i in range(stk_qty):
stk_rpt_date_list = rpt_date_template[random.randint(0,8):] # rpt dates with random start
stk_id_column.extend([stk_id_list[i]] * len(stk_rpt_date_list))
rpt_date_column.extend(stk_rpt_date_list)
index_name = ['STK_ID', 'RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = stk_id_column
second_level_dt = rpt_date_column
dt = pd.DataFrame(np.random.randn(len(stk_id_column), col_num), columns=col_name)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
multilevel_df = dt.set_index(index_name, drop=True, inplace=False)
return multilevel_df
这是一个示例数据。(注:sw() 是显示大数据框四个角数据的方法,源码在:如何预览大熊猫数据框的一部分?)
>>> df = demo_data(5,3)
>>> df.sw()
COL000 COL001 COL002
STK_ID RPT_Date
STK000 20060630 1.8196 0.9519 -1.0526
20060930 -0.4074 -0.9025 1.3562
20061231 -1.1750 0.4190 -1.2976
20070331 -0.5609 1.5190 0.4893
20070630 0.4580 -0.3804 0.3705
20070930 -0.4711 -1.1953 -0.0609
20071231 0.3363 1.1949 1.2802
20080331 1.6359 0.8355 -0.2763
20080630 0.2697 -0.8236 -1.7095
20080930 0.6178 -0.3742 -1.1646
.......................................
STK004 20111231 -0.3198 1.6972 -1.3281
20120331 -1.1905 -0.4597 0.3695
20120630 -0.8253 -0.0502 -0.2862
20120930 0.0059 -1.8535 -1.2107
20121231 0.5762 -0.2872 0.0993
Index : ['STK_ID', 'RPT_Date']
Column: COL000,COL001,COL002
row: 117 col: 3
我想要的自定义平均函数命名为my_avg()
并定义如下规则:
1. Q1's average value is (Q4_of_previous_yr + Q1)/2
2. Q2's average value is (Q4_of_previous_yr + Q1 + Q2)/3
3. Q3's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3)/4
4. Q4's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3 + Q4)/5
5. if some of the data points are not provided, just calculate the normal average of available data points
所以my_avg(df)
每个 STK_ID 都会有以下输出:
STK_ID RPT_Date COL000 COL001 COL002
STK000 20060630 1.819619705 0.951918984 -1.052639309
20060930 0.706112476 0.024688028 0.151757352
20061231 0.079077767 0.156125083 -0.331359614
20070331 -0.867930112 0.969000466 -0.404129827
20070630 -0.425943376 0.519205768 -0.145929753
20070930 -0.437234418 0.090579744 -0.124681449
20071231 -0.282524858 0.3114374 0.156297097
20080331 0.986121631 1.015202552 0.501971496
.......................................
STK004 20111231 xxxxx xxxxxxx xxxxxxx
如何编写代码my_avg()
?
参考:我尝试写一个temp_solution_avg()
函数。但它存在三个问题:
1. the average calculation not include 'Q4_of_previous_yr' data point, so the result is not what I want.
2. data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's data is wrong
3. the calculation speed is very very slow.
In [3]: df = demo_data(500,100)
In [4]: timeit temp_solution_avg(df)
1 loops, best of 3: 66.3 s per loop
def temp_solution_avg(df):
''' return the average , Q1: not change, Q2 : (df.Q1 + df.Q2)/2 ,
Q3: (df.Q1 + df.Q2 + df.Q3)/3, Q4 : (df.Q1 + df.Q2 + df.Q3 + df.Q4)/4
data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's
data is wrong .
'''
dt = df.reset_index()
dt['yr'] = dt['RPT_Date'].str[0:4]
dt['temp_stk_id'] = dt['STK_ID']
dt = dt.set_index(['STK_ID','RPT_Date'], drop=True, inplace=False)
rst = dt.groupby(['temp_stk_id','yr']).transform(pd.expanding_mean)
return rst