python - 如何使用 Pandas 在时间序列中按一个或多个维度进行分组？

Question

我有如下数据：

timestamp, country_code,  request_type,   latency
2013-10-10-13:40:01,  1,    get_account,    134
2013-10-10-13:40:63,  34,   get_account,    256
2013-10-10-13:41:09,  230,  modify_account, 589
2013-10-10-13:41:12,  230,  get_account,    43
2013-10-10-13:53:12,  1,    modify_account, 1003

时间戳是第二个分辨率而不是常规的。

如何在 pandas 查询中表达，例如：

10 分钟分辨率下每个 country_code 的请求数？
request_type 在 1 分钟分辨率下的 99% 百分位延迟？
在 10 分钟分辨率下，每个 country_code 和 request_type 的请求数？

然后在同一张图上绘制所有组，每个组随着时间的推移都作为自己的线。

更新：

基于1的建议。我有：

bycc = df.groupby('country_code').reason.resample('10T', how='count')
bycc.plot() # BAD: uses (country_code, timestamp) on the x axis
bycc[1].plot() # properly graphs the time-series for country_code=1

但似乎无法找到简单的方法将每个 country_code 绘制为单独的行，在 x 轴上具有适当的时间戳，在 y 上具有值。我认为有 2 个问题（1）每个 country_code 的时间戳不同，它们需要在相同的开始/结束上对齐，（2）需要从多索引 TimeSeries 对象中找到正确的 API/方法对于多索引的每个第一个值，使用 1 条线绘制单个图。按我的方式工作...

更新 2

以下似乎可以做到：

i = 0
max = 3
pylab.rcParams['figure.figsize'] = (20.0, 10.0) # get bigger graph
for cc in bycc.index.levels[0]:
    i = i + 1
    if (i <= max):
        cclabel = "cc=%d" % (cc)
        bycc[cc].plot(legend=True, label=cclabel)

只打印最大值，因为它变得嘈杂。现在要弄清楚如何更好地显示具有大量时间序列的图。

score 6 · Accepted Answer

注意：pandas 无法解析日期时间字符串“2013-10-10-13:40:63”，因为每分钟有额外的 4 秒（dateutil无法解析；pandas 使用 dateutil 解析日期）。为了便于说明，我已将其转换为“2013-10-10-13:40:59”。

`country_code`1. 每10 分钟的请求数：

In [83]: df
Out[83]:
                     country_code    request_type  latency
timestamp
2013-10-10 13:40:01             1     get_account      134
2013-10-10 13:40:59            34     get_account      256
2013-10-10 13:41:09           230  modify_account      589
2013-10-10 13:41:12           230     get_account       43
2013-10-10 13:53:12             1  modify_account     1003

In [100]: df.groupby('country_code').request_type.resample('10T', how='count')
Out[100]:
country_code  timestamp
1             2013-10-10 13:40:00    1
              2013-10-10 13:50:00    1
34            2013-10-10 13:40:00    1
230           2013-10-10 13:40:00    2
dtype: int64

2. 99% 的`latency`1`request_type`分钟分辨率

这里也可以采用非常相似的方法：

In [107]: df.groupby('request_type').latency.resample('T', how=lambda x: x.quantile(0.99))
Out[107]:
request_type    timestamp
get_account     2013-10-10 13:40:00     254.78
                2013-10-10 13:41:00      43.00
modify_account  2013-10-10 13:41:00     589.00
                2013-10-10 13:42:00        NaN
                2013-10-10 13:43:00        NaN
                2013-10-10 13:44:00        NaN
                2013-10-10 13:45:00        NaN
                2013-10-10 13:46:00        NaN
                2013-10-10 13:47:00        NaN
                2013-10-10 13:48:00        NaN
                2013-10-10 13:49:00        NaN
                2013-10-10 13:50:00        NaN
                2013-10-10 13:51:00        NaN
                2013-10-10 13:52:00        NaN
                2013-10-10 13:53:00    1003.00
dtype: float64

3. 每10 分钟分辨率`country_code`的请求数`request_type`

这与 # 1 基本相同，只是您在调用中添加了一个额外的组DataFrame.groupby：

In [108]: df.groupby(['country_code', 'request_type']).request_type.resample('10T', how='count')
Out[108]:
country_code  request_type    timestamp
1             get_account     2013-10-10 13:40:00    1
              modify_account  2013-10-10 13:50:00    1
34            get_account     2013-10-10 13:40:00    1
230           get_account     2013-10-10 13:40:00    1
              modify_account  2013-10-10 13:40:00    1
dtype: int64

目前尚不清楚您的要求是什么，请详细说明。

python - 如何使用 Pandas 在时间序列中按一个或多个维度进行分组？

1 回答 1

country_code1. 每10 分钟的请求数：

2. 99% 的latency1request_type分钟分辨率

3. 每10 分钟分辨率country_code的请求数request_type

Related

Reference

`country_code`1. 每10 分钟的请求数：

2. 99% 的`latency`1`request_type`分钟分辨率

3. 每10 分钟分辨率`country_code`的请求数`request_type`