python - 具有分层索引和滚动/移位的 Python pandas 时间序列

Question

与熊猫的滚动和移动概念作斗争。在这个论坛中有很多很好的建议，但我很遗憾地将这些建议应用到我的场景中。

现在我在时间序列上使用传统的循环，但是，迭代超过 150,000 行大约需要 8 个小时，这对于所有代码来说大约是 3 天的数据。有 2 个月的数据来处理它可能不会在我休假回来后完成，更不用说断电的风险，之后我必须重新开始，这次没有休假，等待。

我有以下 15 分钟的股票价格时间序列（日期时间（时间戳）和股票代码的层次索引，唯一的原始列是 closePrice）：

                                     closePrice
    datetime               ticker
    2014-02-04 09:15:00    AAPL      xxx
                           EQIX      xxx
                           FB        xxx
                           GOOG      xxx
                           MSFT      xxx
    2014-02-04 09:30:00    AAPL      xxx
                           EQIX      xxx
                           FB        xxx
                           GOOG      xxx
                           MSFT      xxx
    2014-02-04 09:45:00    AAPL      xxx
                           EQIX      xxx
                           FB        xxx
                           GOOG      xxx
                           MSFT      xxx

我需要添加两列：

12sma，12天移动平均线。搜索了几个小时后，最好的建议是使用 rolling_mean，所以我尝试了。但考虑到我的 TS 结构，它不起作用，即它自上而下地工作，第一个 MA 是基于前 12 行计算的，而不管不同的股票代码值。我如何根据索引（即第一个日期时间然后股票代码）使其平均值，以便我获得 MA 表示 AAPL？目前确实如此（AAPL+EQIX+FB+GOOG+MSFT+AAPL...直到第 12 行）/12
一旦我得到 12sma 柱，我需要 12ema 柱，12 天指数 MA。对于计算，每个代码的时间序列中的第一个值只会从同一行复制 12sma 值。随后，我需要同一行的 closePrice 和前一行的 12ema，即过去 15 分钟。我做了长时间的研究，似乎解决方案是滚动和移位的组合，但我不知道如何将它们组合在一起。

任何帮助我将不胜感激。

谢谢。

编辑：

感谢 Jeff 的提示，在交换和排序 ix 级别之后，我能够使用 rolling_mean() 正确地获得 12sma，并努力在同一时间戳插入从 12sma 复制的第一个 12ema 值：

                                 close  12sma  12ema
    sec_code datetime
    AAPL     2014-02-05 11:45:00 113.0  NaN    NaN
             2014-02-05 12:00:00 113.2  NaN    NaN
             2014-02-05 13:15:00 112.9  NaN    NaN
             2014-02-05 13:30:00 113.2  NaN    NaN
             2014-02-05 13:45:00 113.0  NaN    NaN
             2014-02-05 14:00:00 113.1  NaN    NaN
             2014-02-05 14:15:00 113.3  NaN    NaN
             2014-02-05 14:30:00 113.3  NaN    NaN
             2014-02-05 14:45:00 113.3  NaN    NaN
             2014-02-05 15:00:00 113.2  NaN    NaN
             2014-02-05 15:15:00 113.2  NaN    NaN
             2014-02-05 15:30:00 113.3  113.16 113.16
             2014-02-05 15:45:00 113.3  113.19 NaN
             2014-02-05 16:00:00 113.2  113.19 NaN
             2014-02-06 09:45:00 112.6  113.16 NaN
             2014-02-06 10:00:00 113.5  113.19 NaN
             2014-02-06 10:15:00 113.8  113.25 NaN
             2014-02-06 10:30:00 113.5  113.29 NaN
             2014-02-06 10:45:00 113.7  113.32 NaN
             2014-02-06 11:00:00 113.5  113.34 Nan

我知道 pandas 有 pandas.stats.moments.ewma 但我更喜欢使用我从一本书中得到的公式，该公式需要“目前”的收盘价和前一行的 12ema。

因此，我尝试从 2 月 5 日 15:45 开始填充 12ema 列。我用一个函数尝试了 apply() 但 shift 给出了一个错误：

    def f12ema(x):
        K = 2 / (12 + 1)
        return x['price_nom'] * K + x['12ema'].shift(-1) * (1-K)

    df1.apply(f12ema, axis=1)

    AttributeError: ("'numpy.float64' object has no attribute 'shift'", u'occurred at index 2014-02-05 11:45:00')

我想到的另一种可能性是 rolling_appy() ，但它超出了我的知识范围。

score 1 · Accepted Answer

创建一个包含您想要的时间的日期范围

In [47]: rng = date_range('20130102 09:30:00','20130105 16:00:00',freq='15T')

In [48]: rng = rng.take(rng.indexer_between_time('09:30:00','16:00:00'))

In [49]: rng
Out[49]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-02 09:30:00, ..., 2013-01-05 16:00:00]
Length: 108, Freq: None, Timezone: None

创建一个与您类似的框架（2000 个代码 x 日期）

In [50]: df = DataFrame(np.random.randn(len(rng)*2000,1),columns=['close'],index=MultiIndex.from_product([rng,range(2000)],names=['date','ticker']))

重新排序级别，使其股票代码 x 日期为索引，排序！！！！

In [51]: df = df.swaplevel('ticker','date').sortlevel()

In [52]: df
Out[52]: 
                               close
ticker date                         
0      2013-01-02 09:30:00  0.840767
       2013-01-02 09:45:00  1.808101
       2013-01-02 10:00:00 -0.718354
       2013-01-02 10:15:00 -0.484502
       2013-01-02 10:30:00  0.563363
       2013-01-02 10:45:00  0.553920
       2013-01-02 11:00:00  1.266992
       2013-01-02 11:15:00 -0.641117
       2013-01-02 11:30:00 -0.574673
       2013-01-02 11:45:00  0.861825
       2013-01-02 12:00:00 -1.562111
       2013-01-02 12:15:00 -0.072966
       2013-01-02 12:30:00  0.673079
       2013-01-02 12:45:00  0.766105
       2013-01-02 13:00:00  0.086202
       2013-01-02 13:15:00  0.949205
       2013-01-02 13:30:00 -0.381194
       2013-01-02 13:45:00  0.316813
       2013-01-02 14:00:00 -0.620176
       2013-01-02 14:15:00 -0.193126
       2013-01-02 14:30:00 -1.552111
       2013-01-02 14:45:00  1.724429
       2013-01-02 15:00:00 -0.092393
       2013-01-02 15:15:00  0.197763
       2013-01-02 15:30:00  0.064541
       2013-01-02 15:45:00 -1.574853
       2013-01-02 16:00:00 -1.023979
       2013-01-03 09:30:00 -0.079349
       2013-01-03 09:45:00 -0.749285
       2013-01-03 10:00:00  0.652721
       2013-01-03 10:15:00 -0.818152
       2013-01-03 10:30:00  0.674068
       2013-01-03 10:45:00  2.302714
       2013-01-03 11:00:00  0.614686

                                 ...

[216000 rows x 1 columns]

按股票代码分组。为应用 rolling_mean 和 ewma 的每个股票返回一个 DataFrame。请注意，有很多选项可以控制这个，例如窗口，你可以让它不环绕几天等。

In [53]: df.groupby(level='ticker')['close'].apply(lambda x: concat({ 'spma' : pd.rolling_mean(x,3), 'ema' : pd.ewma(x,3) }, axis=1))
Out[53]: 
                                 ema      spma
ticker date                                   
0      2013-01-02 09:30:00  0.840767       NaN
       2013-01-02 09:45:00  1.393529       NaN
       2013-01-02 10:00:00  0.480282  0.643504
       2013-01-02 10:15:00  0.127447  0.201748
       2013-01-02 10:30:00  0.270334 -0.213164
       2013-01-02 10:45:00  0.356580  0.210927
       2013-01-02 11:00:00  0.619245  0.794758
       2013-01-02 11:15:00  0.269100  0.393265
       2013-01-02 11:30:00  0.041032  0.017067
       2013-01-02 11:45:00  0.258476 -0.117988
       2013-01-02 12:00:00 -0.216742 -0.424986
       2013-01-02 12:15:00 -0.179622 -0.257750
       2013-01-02 12:30:00  0.038741 -0.320666
       2013-01-02 12:45:00  0.223881  0.455406
       2013-01-02 13:00:00  0.188995  0.508462
       2013-01-02 13:15:00  0.380972  0.600504
       2013-01-02 13:30:00  0.188987  0.218071
       2013-01-02 13:45:00  0.221125  0.294942
       2013-01-02 14:00:00  0.009907 -0.228185
       2013-01-02 14:15:00 -0.041013 -0.165496
       2013-01-02 14:30:00 -0.419688 -0.788471
       2013-01-02 14:45:00  0.117299 -0.006936

       2013-01-04 10:00:00 -0.060415  0.341013
       2013-01-04 10:15:00  0.074068  0.604611
       2013-01-04 10:30:00 -0.108502  0.440256
       2013-01-04 10:45:00 -0.514229 -0.636702
                                 ...       ...

[216000 rows x 2 columns]

相当不错的性能，因为它本质上是在代码上循环的。

In [54]: %timeit df.groupby(level='ticker')['close'].apply(lambda x: concat({ 'spma' : pd.rolling_mean(x,3), 'ema' : pd.ewma(x,3) }, axis=1))
1 loops, best of 3: 2.1 s per loop

python - 具有分层索引和滚动/移位的 Python pandas 时间序列

1 回答 1

Related

Reference