1

我有一个包含月度值的数据框,我想获得平均 3 个月组的季度值 我的数据就是这样(例如,只有前 9 个月)

month        01     02     03     04     05     06     07     08     09  \
year                                                                  
2000       90.26  90.95  91.04  90.87  90.78  91.13  90.87  90.95  91.30   
2000       87.89  89.68  90.10  90.27  90.53  90.87  89.93  91.30  91.98   
2000       74.17  74.98  74.74  73.97  74.07  74.26  74.71  76.93  78.67   
2000        NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
2000       86.74  85.48  87.45  88.31  88.71  88.23  88.08  87.76  88.94 

我想获得 Q1 作为月 01,02,03 的平均值。我可以:

 df['Q1']=(df['01']+df['02']+df['03'])/3

但是我会遇到楠的问题。

我可以以三个月为一组计算均值吗?

4

3 回答 3

2

您可以使用 loc 手动执行此操作并表示:

In [11]: df.loc[:, ['01', '02', '03']]
Out[11]: 
         01     02     03
year                    
2000  90.26  90.95  91.04
2000  87.89  89.68  90.10
2000  74.17  74.98  74.74
2000    NaN    NaN    NaN
2000  86.74  85.48  87.45

In [12]: df.loc[:, ['01', '02', '03']].mean(axis=1)
Out[12]: 
year
2000    90.750000
2000    89.223333
2000    74.630000
2000          NaN
2000    86.556667
dtype: float64

但使用 pandas 的rolling_mean可能更有意义:

In [21]: pd.rolling_mean(df.T, 3)
Out[21]: 
year        2000       2000       2000  2000       2000
month                                                  
01           NaN        NaN        NaN   NaN        NaN
02           NaN        NaN        NaN   NaN        NaN
03     90.750000  89.223333  74.630000   NaN  86.556667
04     90.953333  90.016667  74.563333   NaN  87.080000
05     90.896667  90.300000  74.260000   NaN  88.156667
06     90.926667  90.556667  74.100000   NaN  88.416667
07     90.926667  90.443333  74.346667   NaN  88.340000
08     90.983333  90.700000  75.300000   NaN  88.023333
09     91.040000  91.070000  76.770000   NaN  88.260000

默认情况下,这会向后看 3 个周期,因此我们必须将其向上移动两个:

In [22]: pd.rolling_mean(df.T, 3).shift(-2)
Out[22]: 
year        2000       2000       2000  2000       2000
month                                                  
01     90.750000  89.223333  74.630000   NaN  86.556667
02     90.953333  90.016667  74.563333   NaN  87.080000
03     90.896667  90.300000  74.260000   NaN  88.156667
04     90.926667  90.556667  74.100000   NaN  88.416667
05     90.926667  90.443333  74.346667   NaN  88.340000
06     90.983333  90.700000  75.300000   NaN  88.023333
07     91.040000  91.070000  76.770000   NaN  88.260000
08           NaN        NaN        NaN   NaN        NaN
09           NaN        NaN        NaN   NaN        NaN

并转置以获取正确的形式:

In [23]: pd.rolling_mean(df.T, 3).shift(-2).T
Out[23]: 
month         01         02         03         04         05         06      07   08   09
year                                                                      
2000   90.750000  90.953333  90.896667  90.926667  90.926667  90.983333   91.04  NaN  NaN 
2000   89.223333  90.016667  90.300000  90.556667  90.443333  90.700000   91.07  NaN  NaN 
2000   74.630000  74.563333  74.260000  74.100000  74.346667  75.300000   76.77  NaN  NaN 
2000         NaN        NaN        NaN        NaN        NaN        NaN     NaN  NaN  NaN  
2000   86.556667  87.080000  88.156667  88.416667  88.340000  88.023333   88.26  NaN  NaN 
于 2013-09-11T16:33:35.110 回答
1

使用重采样。

In [89]: x
Out[89]: 
           1      2      3      4      5      6      7      8      9
month                                                               
2000   90.26  90.95  91.04  90.87  90.78  91.13  90.87  90.95  91.30
2000   87.89  89.68  90.10  90.27  90.53  90.87  89.93  91.30  91.98
2000   74.17  74.98  74.74  73.97  74.07  74.26  74.71  76.93  78.67
2000     NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
2000   86.74  85.48  87.45  88.31  88.71  88.23  88.08  87.76  88.94

In [90]: x.columns = pd.PeriodIndex([pd.Period(year=2000, month=m, freq='M')
                                         for m in x.columns])

In [92]: x.index = ['%s_%s' % (y,i) for i, y in enumerate(x.index)]

In [93]: x
Out[93]: 
        2000-01  2000-02  2000-03  2000-04  2000-05  2000-06  2000-07  2000-08  2000-09
2000_0    90.26    90.95    91.04    90.87    90.78    91.13    90.87    90.95    91.30
2000_1    87.89    89.68    90.10    90.27    90.53    90.87    89.93    91.30    91.98
2000_2    74.17    74.98    74.74    73.97    74.07    74.26    74.71    76.93    78.67
2000_3      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
2000_4    86.74    85.48    87.45    88.31    88.71    88.23    88.08    87.76    88.94

In [94]: x.resample('Q', axis=1)
Out[94]: 
           2000Q1     2000Q2  2000Q3
2000_0  90.750000  90.926667   91.04
2000_1  89.223333  90.556667   91.07
2000_2  74.630000  74.100000   76.77
2000_3        NaN        NaN     NaN
2000_4  86.556667  88.416667   88.26

对重复索引进行重新采样时存在一个错误,这就是我在此处重命名索引的原因,它已修复为 0.13(但此解决方案使用的是 0.12)。

这最终是最灵活的,因为您现在可以按不同的频率重新采样。

In [95]: x.resample('Q-JAN', axis=1)
Out[95]: 
        2000Q4     2001Q1     2001Q2  2001Q3
2000_0   90.26  90.953333  90.926667  91.125
2000_1   87.89  90.016667  90.443333  91.640
2000_2   74.17  74.563333  74.346667  77.800
2000_3     NaN        NaN        NaN     NaN
2000_4   86.74  87.080000  88.340000  88.350
于 2013-09-11T16:49:45.487 回答
0
import pandas as pd
import io

content = io.BytesIO('''\
year        01     02     03     04     05     06     07     08     09  
2000       90.26  90.95  91.04  90.87  90.78  91.13  90.87  90.95  91.30   
2000       87.89  89.68  90.10  90.27  90.53  90.87  89.93  91.30  91.98   
2000       74.17  74.98  74.74  73.97  74.07  74.26  74.71  76.93  78.67   
2000        NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
2000       86.74  85.48  87.45  88.31  88.71  88.23  88.08  87.76  88.94''')

df = pd.read_table(content, sep='\s+', index_col=0)
df.columns.name='month'
df2 = df.groupby(by=lambda x: (int(x)-1)//3, axis=1).mean()
df2.columns='Q1 Q2 Q3'.split()
print(df2)

             Q1         Q2     Q3
year                             
2000  90.750000  90.926667  91.04
2000  89.223333  90.556667  91.07
2000  74.630000  74.100000  76.77
2000        NaN        NaN    NaN
2000  86.556667  88.416667  88.26

You can join these columns to the original dataframe with

df = pd.concat([df2, df], axis=1)
于 2013-09-11T16:43:03.630 回答