考虑到您有一些不均匀的时间序列数据:
import pandas as pd
import random as randy
ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index()
print ts.head()
2013-02-01 09:00:00.002895 995
2013-02-01 09:00:00.003765 499
2013-02-01 09:00:00.003838 797
2013-02-01 09:00:00.004727 295
2013-02-01 09:00:00.006287 253
假设我想在 1ms 窗口内进行滚动求和来得到这个:
2013-02-01 09:00:00.002895 995
2013-02-01 09:00:00.003765 499 + 995
2013-02-01 09:00:00.003838 797 + 499 + 995
2013-02-01 09:00:00.004727 295 + 797 + 499
2013-02-01 09:00:00.006287 253
目前,我将所有内容都转换为 longs 并在 cython 中执行此操作,但这在纯熊猫中是否可行?我知道您可以执行类似 .asfreq('U') 之类的操作,然后填充并使用传统功能,但是一旦您获得的行数超过玩具 #,这将无法扩展。
作为参考,这里有一个 hackish,不是快速的 Cython 版本:
%%cython
import numpy as np
cimport cython
cimport numpy as np
ctypedef np.double_t DTYPE_t
def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size):
cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start
cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double)
assert(t_len==s_len)
for i in range(0,t_len):
window_start = times[i] - win_size
j = i
while times[j]>= window_start and j>=0:
res[i] += to_add[j]
j-=1
return res
在一个稍大的系列中证明这一点:
ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index()
%%timeit
res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6))
1000 loops, best of 3: 1.56 ms per loop