python - 在 pandas 中以更快的方式分组一天中的时间

Question

我有几天的 1 分钟数据的时间序列，并希望按一天中的时间对所有天进行平均。

这非常慢：

from datetime import datetime
from pandas import date_range, Series
time_ind = date_range(datetime(2013, 1, 1), datetime(2013, 1, 10), freq='1min')
all_data = Series(randn(len(time_ind)), time_ind)
time_mean = all_data.groupby(lambda x: x.time()).mean()

运行大约需要一分钟！

虽然像：

time_mean = all_data.groupby(lambda x: x.minute).mean()

只需要几分之一秒。

有没有更快的按时间分组的方法？

知道为什么这么慢吗？

score 3 · Accepted Answer

您的“lambda-version”和0.11 版中引入的时间属性在 0.11.0 版中似乎都很慢：

In [4]: %timeit all_data.groupby(all_data.index.time).mean()
1 loops, best of 3: 11.8 s per loop

In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored
1 loops, best of 3: 11.8 s per loop

使用当前的 master 两种方法都快得多：

In [1]: pd.version.version
Out[1]: '0.11.1.dev-06cd915'

In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()
1 loops, best of 3: 215 ms per loop

In [6]: %timeit all_data.groupby(all_data.index.time).mean()
10 loops, best of 3: 113 ms per loop
'0.11.1.dev-06cd915'

因此，您可以更新到 master 或等待本月发布的 0.11.1。

score 2 · Accepted Answer

按小时/分钟/.. 属性而不是.time. 这是 Jeff 的基线：

In [11]: %timeit all_data.groupby(all_data.index.time).mean()
1 loops, best of 3: 202 ms per loop

并且没有时间它会更快（属性越少越快）：

In [12]: %timeit all_data.groupby(all_data.index.hour).mean()
100 loops, best of 3: 5.53 ms per loop

In [13]: %timeit all_data.groupby([all_data.index.hour, all_data.index.minute, all_data.index.second, all_data.index.microsecond]).mean()
10 loops, best of 3: 20.8 ms per loop

注意：时间对象不接受纳秒（但这是 DatetimeIndex 的分辨率）。

我们可能应该将索引转换为具有时间对象以使比较公平：

In [21]: res = all_data.groupby([all_data.index.hour, all_data.index.minute, all_data.index.second, all_data.index.microsecond]).mean()

In [22]: %timeit res.index.map(lambda t: datetime.time(*t))
1000 loops, best of 3: 1.39 ms per loop

In [23]: res.index = res.index.map(lambda t: datetime.time(*t))

因此，对于最大分辨率，它的速度大约快 10 倍，您可以轻松地使其更粗糙（更快），例如按小时和分钟分组..

python - 在 pandas 中以更快的方式分组一天中的时间

2 回答 2

Related

Reference