6

我有一个数据框:

ID  DATE       WIN
A   2015/6/5   Yes
A   2015/6/7   Yes
A   2015/6/7   Yes
A   2015/6/7   Yes
B   2015/6/8   No
B   2015/8/7   Yes
C   2015/5/15  Yes
C   2015/5/30  No
C   2015/7/30  No
C   2015/8/03  Yes

我想添加一个列来计算ID过去 1 个月内每个人的获胜次数,因此结果将如下所示:

ID  DATE       WIN  NumOfDaysSinceLastWin NumOfWinsInThePast30days
A   2015/6/5   Yes           0               0       
A   2015/6/7   Yes           2               1 
A   2015/6/7   Yes           2               1 or (A 2015/6/7 Yes 0 2)
A   2015/6/8   No            1               3 
B   2015/8/7   No            0               0
B   2015/8/7   Yes           0               0
C   2015/5/15  Yes           0               0
C   2015/5/30  No            15              1
C   2015/7/30  No            76              0
C   2015/8/03  Yes           80              0

我如何使用groupby功能并timegrouper获得它?

4

1 回答 1

1

输入数据必须按日期在每个组中排序,在这个数据中就可以了。
输入数据不能很好地映射情况,因此将添加接下来的 4 行。

WIN1是从WIN-和1for的值创建的。我需要它用于两个输出列。'Yes'0'No'

df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)

柱子NumOfDaysSinceLastWin

首先cumsum创建列(累积和)。

df['cumsum'] = df['WIN1'].cumsum()

如果WIN都是'Yes',那就容易了。数据将被分组,日期和前一个日期(-1)值之间的差异在列中diffs

#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0)) 

但是情况很复杂,因为'No'column 的值WIN。所以如果 value 是'Yes',你需要与previous 'Yes'不同,如果'No'需要与last previous 'WIN'不同。可以通过多种方式计算差异,但是通过减去两列来选择 -DATE和 column date1

date1
行必须分组特殊方式值'No'和最后一个'Yes'。可以通过列中的累积总和cumsum。然后该组的最小值是'Yes'列的值,然后该值重复到具有'No'值的行。列是特殊的 -列count没有重复值。重复的按组递增。cumsum1

df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')

前几行中值'YES'的日期对于差异是必要的。Dataframedf1仅过滤 df 的值'Yes',然后按 column 对其进行分组ID。索引不变,因此输出可以映射到 dataframe 的新列df

df1 = df[~df['WIN'].isin(['No'])]
df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift()) 
print df

   ID       DATE  WIN  WIN1  cumsum        min  count      date1
0   A 2015-06-05  Yes     1       1 2015-06-05      1        NaT
1   A 2015-06-05  Yes     1       2 2015-06-05      1 2015-06-05
2   A 2015-06-07  Yes     1       3 2015-06-07      1 2015-06-05
3   A 2015-06-07  Yes     1       4 2015-06-07      1 2015-06-07
4   A 2015-06-07  Yes     1       5 2015-06-07      4 2015-06-07
5   A 2015-06-08   No     0       5 2015-06-07      4        NaT
6   B 2015-06-07   No     0       5 2015-06-07      4        NaT
7   B 2015-06-07   No     0       5 2015-06-07      4        NaT
8   B 2015-08-07  Yes     1       6 2015-08-07      1        NaT
9   C 2015-05-15  Yes     1       7 2015-05-15      3        NaT
10  C 2015-05-30   No     0       7 2015-05-15      3        NaT
11  C 2015-07-30   No     0       7 2015-05-15      3        NaT
12  C 2015-08-03  Yes     1       8 2015-08-03      1 2015-05-15
13  C 2015-08-03  Yes     1       9 2015-08-03      1 2015-08-03

然后日期列min(值'No'和上一个上一个'Yes')和列date1(其他值'Yes')可以按列连接count
添加了新条件 - column 的值date1将为 null - ( NaT),因为这些值将被 column 覆盖min

df.loc[(df['count'] > 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df

   ID       DATE  WIN  WIN1  cumsum        min  count      date1
0   A 2015-06-05  Yes     1       1 2015-06-05      1 2015-06-05
1   A 2015-06-05  Yes     1       2 2015-06-05      1 2015-06-05
2   A 2015-06-07  Yes     1       3 2015-06-07      1 2015-06-05
3   A 2015-06-07  Yes     1       4 2015-06-07      1 2015-06-07
4   A 2015-06-07  Yes     1       5 2015-06-07      4 2015-06-07
5   A 2015-06-08   No     0       5 2015-06-07      4 2015-06-07
6   B 2015-06-07   No     0       5 2015-06-07      4 2015-06-07
7   B 2015-06-07   No     0       5 2015-06-07      4 2015-06-07
8   B 2015-08-07  Yes     1       6 2015-08-07      1 2015-08-07
9   C 2015-05-15  Yes     1       7 2015-05-15      3 2015-05-15
10  C 2015-05-30   No     0       7 2015-05-15      3 2015-05-15
11  C 2015-07-30   No     0       7 2015-05-15      3 2015-05-15
12  C 2015-08-03  Yes     1       8 2015-08-03      1 2015-05-15
13  C 2015-08-03  Yes     1       9 2015-08-03      1 2015-08-03

重复日期时间 - 子
解决方案 对不起,如果这样做很复杂,也许有人会找到更好的。
我的解决方案是找到重复值,用最后一个填充它们'Yes'并添加到date1差异列中。
这些值在列中标识count。其他(值1)重置为NaN。然后值 fromdate1被复制到date2column count

df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df    

   ID       DATE  WIN  WIN1  cumsum        min  count      date1      date2
0   A 2015-06-05  Yes     1       1 2015-06-05      2 2015-06-05 2015-06-05
1   A 2015-06-05  Yes     1       2 2015-06-05      2 2015-06-05 2015-06-05
2   A 2015-06-07  Yes     1       3 2015-06-07      3 2015-06-05 2015-06-05
3   A 2015-06-07  Yes     1       4 2015-06-07      3 2015-06-07 2015-06-07
4   A 2015-06-07  Yes     1       5 2015-06-07      3 2015-06-07 2015-06-07
5   A 2015-06-08   No     0       5 2015-06-07    NaN 2015-06-07        NaT
6   B 2015-06-07   No     0       5 2015-06-07    NaN 2015-06-07        NaT
7   B 2015-06-07   No     0       5 2015-06-07    NaN 2015-06-07        NaT
8   B 2015-08-07  Yes     1       6 2015-08-07    NaN 2015-08-07        NaT
9   C 2015-05-15  Yes     1       7 2015-05-15    NaN 2015-05-15        NaT
10  C 2015-05-30   No     0       7 2015-05-15    NaN 2015-05-15        NaT
11  C 2015-07-30   No     0       7 2015-05-15    NaN 2015-05-15        NaT
12  C 2015-08-03  Yes     1       8 2015-08-03      2 2015-05-15 2015-05-15
13  C 2015-08-03  Yes     1       9 2015-08-03      2 2015-08-03 2015-08-03 

然后这些值按组的最小值重复并添加到date1列中。

def repeat_value(grp):
    grp['date2'] = grp['date2'].min()
    return grp

df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df1['date2'].notnull() , 'date1'] = df['date2']
print df

   ID       DATE  WIN  WIN1  cumsum        min  count      date1      date2
0   A 2015-06-05  Yes     1       1 2015-06-05      2 2015-06-05 2015-06-05
1   A 2015-06-05  Yes     1       2 2015-06-05      2 2015-06-05 2015-06-05
2   A 2015-06-07  Yes     1       3 2015-06-07      3 2015-06-05 2015-06-05
3   A 2015-06-07  Yes     1       4 2015-06-07      3 2015-06-05 2015-06-05
4   A 2015-06-07  Yes     1       5 2015-06-07      3 2015-06-05 2015-06-05
5   A 2015-06-08   No     0       5 2015-06-07    NaN 2015-06-07        NaT
6   B 2015-06-07   No     0       5 2015-06-07    NaN 2015-06-07        NaT
7   B 2015-06-07   No     0       5 2015-06-07    NaN 2015-06-07        NaT
8   B 2015-08-07  Yes     1       6 2015-08-07    NaN 2015-08-07        NaT
9   C 2015-05-15  Yes     1       7 2015-05-15    NaN 2015-05-15        NaT
10  C 2015-05-30   No     0       7 2015-05-15    NaN 2015-05-15        NaT
11  C 2015-07-30   No     0       7 2015-05-15    NaN 2015-05-15        NaT
12  C 2015-08-03  Yes     1       8 2015-08-03      2 2015-05-15 2015-05-15
13  C 2015-08-03  Yes     1       9 2015-08-03      2 2015-05-15 2015-05-15 

NumOfDaysSinceLastWin由 coldate1和的差异填充DATE。数据类型为Timedelta,因此将转换为整数。最后不必要的列将被删除。(只有列WIN1,并且count是下一个输出列所必需的,因此不会被删除。)

df['NumOfDaysSinceLastWin'] = ((df['DATE'] - df['date1']).fillna(0)).astype('timedelta64[D]')
df = df.drop(['cumsum','min', 'date1'], axis=1 )
print df
   ID       DATE  WIN  WIN1  count  NumOfDaysSinceLastWin
0   A 2015-06-05  Yes     1      2                      0
1   A 2015-06-05  Yes     1      2                      0
2   A 2015-06-07  Yes     1      3                      2
3   A 2015-06-07  Yes     1      3                      2
4   A 2015-06-07  Yes     1      3                      2
5   A 2015-06-08   No     0    NaN                      1
6   B 2015-06-07   No     0    NaN                      0
7   B 2015-06-07   No     0    NaN                      0
8   B 2015-08-07  Yes     1    NaN                      0
9   C 2015-05-15  Yes     1    NaN                      0
10  C 2015-05-30   No     0    NaN                     15
11  C 2015-07-30   No     0    NaN                     76
12  C 2015-08-03  Yes     1      2                     80
13  C 2015-08-03  Yes     1      2                     80

柱子NumOfWinsInThePast30days

滚动总和是你的朋友。列yes(重采样所必需的)由1for'Yes'NaNfor映射'No'

df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)

数据框df2是副本,dfDATE设置为索引(用于重采样)。不必要的列将被删除。

df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1'], axis=1)

然后df2将按天重新采样,如果行是'Yes'1,如果'No'0。(最好参见下面的解释。)

df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()

数据框df2将按这些组进行分组,ID并且函数rolling_sum正在使用这些组。

df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)

df2为了更好地理解,将显示所有行。

with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
    print df2

    ID       DATE  yes  rollsum
0    A 2015-06-05    2        2
1    A 2015-06-06    0        2
2    A 2015-06-07    3        5
3    A 2015-06-08    0        5
4    B 2015-06-07    0        0
5    B 2015-06-08    0        0
6    B 2015-06-09    0        0
7    B 2015-06-10    0        0
8    B 2015-06-11    0        0
9    B 2015-06-12    0        0
10   B 2015-06-13    0        0
11   B 2015-06-14    0        0
12   B 2015-06-15    0        0
13   B 2015-06-16    0        0
14   B 2015-06-17    0        0
15   B 2015-06-18    0        0
16   B 2015-06-19    0        0
17   B 2015-06-20    0        0
18   B 2015-06-21    0        0
19   B 2015-06-22    0        0
20   B 2015-06-23    0        0
21   B 2015-06-24    0        0
22   B 2015-06-25    0        0
23   B 2015-06-26    0        0
24   B 2015-06-27    0        0
25   B 2015-06-28    0        0
26   B 2015-06-29    0        0
27   B 2015-06-30    0        0
28   B 2015-07-01    0        0
29   B 2015-07-02    0        0
30   B 2015-07-03    0        0
31   B 2015-07-04    0        0
32   B 2015-07-05    0        0
33   B 2015-07-06    0        0
34   B 2015-07-07    0        0
35   B 2015-07-08    0        0
36   B 2015-07-09    0        0
37   B 2015-07-10    0        0
38   B 2015-07-11    0        0
39   B 2015-07-12    0        0
40   B 2015-07-13    0        0
41   B 2015-07-14    0        0
42   B 2015-07-15    0        0
43   B 2015-07-16    0        0
44   B 2015-07-17    0        0
45   B 2015-07-18    0        0
46   B 2015-07-19    0        0
47   B 2015-07-20    0        0
48   B 2015-07-21    0        0
49   B 2015-07-22    0        0
50   B 2015-07-23    0        0
51   B 2015-07-24    0        0
52   B 2015-07-25    0        0
53   B 2015-07-26    0        0
54   B 2015-07-27    0        0
55   B 2015-07-28    0        0
56   B 2015-07-29    0        0
57   B 2015-07-30    0        0
58   B 2015-07-31    0        0
59   B 2015-08-01    0        0
60   B 2015-08-02    0        0
61   B 2015-08-03    0        0
62   B 2015-08-04    0        0
63   B 2015-08-05    0        0
64   B 2015-08-06    0        0
65   B 2015-08-07    1        1
66   C 2015-05-15    1        1
67   C 2015-05-16    0        1
68   C 2015-05-17    0        1
69   C 2015-05-18    0        1
70   C 2015-05-19    0        1
71   C 2015-05-20    0        1
72   C 2015-05-21    0        1
73   C 2015-05-22    0        1
74   C 2015-05-23    0        1
75   C 2015-05-24    0        1
76   C 2015-05-25    0        1
77   C 2015-05-26    0        1
78   C 2015-05-27    0        1
79   C 2015-05-28    0        1
80   C 2015-05-29    0        1
81   C 2015-05-30    0        1
82   C 2015-05-31    0        1
83   C 2015-06-01    0        1
84   C 2015-06-02    0        1
85   C 2015-06-03    0        1
86   C 2015-06-04    0        1
87   C 2015-06-05    0        1
88   C 2015-06-06    0        1
89   C 2015-06-07    0        1
90   C 2015-06-08    0        1
91   C 2015-06-09    0        1
92   C 2015-06-10    0        1
93   C 2015-06-11    0        1
94   C 2015-06-12    0        1
95   C 2015-06-13    0        1
96   C 2015-06-14    0        0
97   C 2015-06-15    0        0
98   C 2015-06-16    0        0
99   C 2015-06-17    0        0
100  C 2015-06-18    0        0
101  C 2015-06-19    0        0
102  C 2015-06-20    0        0
103  C 2015-06-21    0        0
104  C 2015-06-22    0        0
105  C 2015-06-23    0        0
106  C 2015-06-24    0        0
107  C 2015-06-25    0        0
108  C 2015-06-26    0        0
109  C 2015-06-27    0        0
110  C 2015-06-28    0        0
111  C 2015-06-29    0        0
112  C 2015-06-30    0        0
113  C 2015-07-01    0        0
114  C 2015-07-02    0        0
115  C 2015-07-03    0        0
116  C 2015-07-04    0        0
117  C 2015-07-05    0        0
118  C 2015-07-06    0        0
119  C 2015-07-07    0        0
120  C 2015-07-08    0        0
121  C 2015-07-09    0        0
122  C 2015-07-10    0        0
123  C 2015-07-11    0        0
124  C 2015-07-12    0        0
125  C 2015-07-13    0        0
126  C 2015-07-14    0        0
127  C 2015-07-15    0        0
128  C 2015-07-16    0        0
129  C 2015-07-17    0        0
130  C 2015-07-18    0        0
131  C 2015-07-19    0        0
132  C 2015-07-20    0        0
133  C 2015-07-21    0        0
134  C 2015-07-22    0        0
135  C 2015-07-23    0        0
136  C 2015-07-24    0        0
137  C 2015-07-25    0        0
138  C 2015-07-26    0        0
139  C 2015-07-27    0        0
140  C 2015-07-28    0        0
141  C 2015-07-29    0        0
142  C 2015-07-30    0        0
143  C 2015-07-31    0        0
144  C 2015-08-01    0        0
145  C 2015-08-02    0        0
146  C 2015-08-03    2        2

不必要的列yes将被删除。

df2 = df2.drop(['yes'], axis=1 )

输出正在与第一个 dataframe 合并df

df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2

  ID       DATE  WIN  WIN1  NumOfDaysSinceLastWin  yes  rollsum
0  A 2015-06-07  Yes     1                      0    1        2
1  A 2015-06-07  Yes     1                      0    1        2
2  B 2015-08-07   No     0                      0  NaN        1
3  B 2015-08-07  Yes     1                      0    1        1
4  C 2015-05-15  Yes     1                      0    1        1
5  C 2015-05-30   No     0                     15  NaN        1
6  C 2015-07-30   No     0                     76  NaN        0
7  C 2015-08-03  Yes     1                     80    1        1

如果 columncount中的值不是null,则将它们添加到 column count。函数是用 valuesrolling_sum计算原始行数,因此必须减去。并且这个值 ( ) 在列中。df'YES'1WIN1

df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']

删除不必要的列。

df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2
   ID       DATE  WIN  NumOfDaysSinceLastWin  NumOfWinsInThePast30days
0   A 2015-06-05  Yes                      0                         0
1   A 2015-06-05  Yes                      0                         0
2   A 2015-06-07  Yes                      2                         2
3   A 2015-06-07  Yes                      2                         2
4   A 2015-06-07  Yes                      2                         2
5   A 2015-06-08   No                      1                         5
6   B 2015-06-07   No                      0                         0
7   B 2015-06-07   No                      0                         0
8   B 2015-08-07  Yes                      0                         0
9   C 2015-05-15  Yes                      0                         0
10  C 2015-05-30   No                     15                         1
11  C 2015-07-30   No                     76                         0
12  C 2015-08-03  Yes                     80                         0
13  C 2015-08-03  Yes                     80                         0

最后一起:

import pandas as pd
import numpy as np
import io

#original data
temp=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes"""

#changed repeating data
temp2=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes
C,2015/8/03,Yes"""

df = pd.read_csv(io.StringIO(temp2), parse_dates = [1])

df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
df['cumsum'] = df['WIN1'].cumsum()

#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0)) 

df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')

df1 = df[~df['WIN'].isin(['No'])]

df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift()) 
print df
df.loc[(df['count'] >= 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df

#resolve repeating datetimes
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df

def repeat_value(grp):
    grp['date2'] = grp['date2'].min()
    return grp

df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df['date2'].notnull() , 'date1'] = df['date2']
print df

df['NumOfDaysSinceLastWin'] = (df['DATE'] - df['date1']).astype('timedelta64[D]')
df = df.drop(['cumsum','min','date1', 'date2'], axis=1 )
print df

#NumOfWinsInThePast30days
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1','count'], axis=1)

df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()

df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)

#with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
    #print df2

df2 = df2.drop(['yes'], axis=1 )
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2

df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']

df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2
于 2015-09-19T18:43:32.007 回答