0

我有这样的数据框:

   STOREID VARIANT_ARTICLE   PO_DATE  UNITSUM
0    st123           12345  20200427      9.0
1    st123           12345  20200428      3.0
2    st123           12345  20200429     13.0
3    st123           12345  20200430      7.0
4    st123           12345  20200501     16.0
5    st123           12345  20200502      3.0
6    st123           12345  20200503      5.0
7    st123           12345  20200504     10.0
8    st123           12345  20200505      3.0
9    st123           12345  20200506      7.0
10   st123           12345  20200507     29.0
11   st123           12345  20200508      4.0
12   st123           12345  20200509      9.0
13   st123           12345  20200510      8.0
14   st123           12345  20200511      5.0
15   st123           12345  20200513      8.0
16   st123           12345  20200514      2.0
17   st123           12345  20200515      2.0
18   st123           12345  20200516      2.0

我想计算列的rolling sum和。这里的问题是我需要计算过去 4 天(例如)而不是前 4 条记录,这实际上意味着,对于我的示例中的行,要聚合的日期范围是. 由于没有 的条目,我们对 3 个可用行进行聚合,并且不包括在我们的计算中(就像在滚动函数中所做的那样)。avgUNITSUM15th20200510 - 202005132020051220200509pandas

有没有办法做到这一点?

编辑:我必须使用dask-dataframeAPI 来实现这一点。

4

1 回答 1

1

dask 数据帧与 pandas API 具有相同的语法:

In [38]: ddf = dask.datasets.timeseries()

In [39]: ddf.head()
Out[39]:
                       id    name         x         y
timestamp
2000-01-01 00:00:00  1003  George -0.287285  0.773949
2000-01-01 00:00:01   992  Oliver -0.738190  0.893916
2000-01-01 00:00:02   972   Jerry  0.080410 -0.972037
2000-01-01 00:00:03   970  George -0.402327  0.034718
2000-01-01 00:00:04  1034   Alice -0.694517  0.646178

In [40]: ddf.x.rolling(4).agg({'sum': 'sum', 'mean': 'mean'}).head()
Out[40]:
                          sum      mean
timestamp
2000-01-01 00:00:00       NaN       NaN
2000-01-01 00:00:01       NaN       NaN
2000-01-01 00:00:02       NaN       NaN
2000-01-01 00:00:03 -1.347393 -0.336848
2000-01-01 00:00:04 -1.754625 -0.438656

于 2020-06-19T18:45:35.547 回答