2

我有一个谜题。这在excel中很容易。但是,在熊猫中,使用数据框 df:

   |  EventID  |  PictureID  |  Date
0  |  1        |  A          |  2010-01-01
1  |  2        |  A          |  2010-02-01
2  |  3        |  A          |  2010-02-15
3  |  4        |  B          |  2010-01-01
4  |  5        |  C          |  2010-02-01
5  |  6        |  C          |  2010-02-15

有没有办法添加一个新列来计算相同 PictureID 在过去 6 个月内记录事件的次数?换句话说,数据框中与给定行具有相同 PictureID 且日期在给定行日期之前的六个月内的行数。

df['PastSix'] = ???

所以输出看起来像:

   |  EventID  |  PictureID  |  Date        |  PastSix
0  |  1        |  A          |  2010-01-01  |  0
1  |  2        |  A          |  2010-02-01  |  1
2  |  3        |  A          |  2010-02-15  |  2
3  |  4        |  B          |  2010-01-01  |  0
4  |  5        |  C          |  2010-02-01  |  0
5  |  6        |  C          |  2010-02-15  |  1
4

1 回答 1

2

我不知道如何定义6个月,所以我用prev 183天代替,基本思想是使用asof()方法:

import pandas as pd
import numpy as np
import io

txt = u"""EventID  |  PictureID  |  Date
0        |  A          |  2009-07-01
1        |  A          |  2010-01-01
2        |  A          |  2010-02-01
3        |  A          |  2010-02-15
4        |  B          |  2010-01-01
5        |  C          |  2010-02-01
6        |  C          |  2010-02-15
7        |  A          |  2010-08-01
"""

df = pd.read_csv(io.StringIO(txt), sep=r"\s*\|\s*", parse_dates=["Date"])

def f(df):
    count = pd.Series(np.arange(1, len(df)+1), index=df["Date"])
    prev1day = count.index.shift(-1, freq="D")
    prev6month = count.index.shift(-183, freq="D")
    result = count.asof(prev1day).fillna(0).values - count.asof(prev6month).fillna(0).values
    return pd.Series(result, df.index)

df["PastSix"] = df.groupby("PictureID").apply(f)
print df

输出:

   EventID PictureID                Date  PastSix
0        0         A 2009-07-01 00:00:00        0
1        1         A 2010-01-01 00:00:00        0
2        2         A 2010-02-01 00:00:00        1
3        3         A 2010-02-15 00:00:00        2
4        4         B 2010-01-01 00:00:00        0
5        5         C 2010-02-01 00:00:00        0
6        6         C 2010-02-15 00:00:00        1
7        7         A 2010-08-01 00:00:00        2
于 2013-09-15T12:32:25.710 回答