输入数据必须按日期在每个组中排序,在这个数据中就可以了。
输入数据不能很好地映射情况,因此将添加接下来的 4 行。
列WIN1
是从WIN
-和1
for的值创建的。我需要它用于两个输出列。'Yes'
0
'No'
df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
柱子NumOfDaysSinceLastWin
首先cumsum
创建列(累积和)。
df['cumsum'] = df['WIN1'].cumsum()
如果WIN
都是'Yes'
,那就容易了。数据将被分组,日期和前一个日期(-1)值之间的差异在列中diffs
。
#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0))
但是情况很复杂,因为'No'
column 的值WIN
。所以如果 value 是'Yes'
,你需要与previous 'Yes'
不同,如果'No'
需要与last previous 'WIN'
不同。可以通过多种方式计算差异,但是通过减去两列来选择 -DATE
和 column date1
。
列date1
行必须分组特殊方式值'No'
和最后一个'Yes'
。可以通过列中的累积总和cumsum
。然后该组的最小值是'Yes'
列的值,然后该值重复到具有'No'
值的行。列是特殊的 -列count
没有重复值。重复的按组递增。cumsum
1
df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')
前几行中值'YES'
的日期对于差异是必要的。Dataframedf1
仅过滤 df 的值'Yes'
,然后按 column 对其进行分组ID
。索引不变,因此输出可以映射到 dataframe 的新列df
。
df1 = df[~df['WIN'].isin(['No'])]
df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift())
print df
ID DATE WIN WIN1 cumsum min count date1
0 A 2015-06-05 Yes 1 1 2015-06-05 1 NaT
1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 4 NaT
6 B 2015-06-07 No 0 5 2015-06-07 4 NaT
7 B 2015-06-07 No 0 5 2015-06-07 4 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 1 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 3 NaT
10 C 2015-05-30 No 0 7 2015-05-15 3 NaT
11 C 2015-07-30 No 0 7 2015-05-15 3 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03
然后日期列min
(值'No'
和上一个上一个'Yes'
)和列date1
(其他值'Yes'
)可以按列连接count
。
添加了新条件 - column 的值date1
将为 null - ( NaT
),因为这些值将被 column 覆盖min
。
df.loc[(df['count'] > 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df
ID DATE WIN WIN1 cumsum min count date1
0 A 2015-06-05 Yes 1 1 2015-06-05 1 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 4 2015-06-07
6 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07
7 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07
8 B 2015-08-07 Yes 1 6 2015-08-07 1 2015-08-07
9 C 2015-05-15 Yes 1 7 2015-05-15 3 2015-05-15
10 C 2015-05-30 No 0 7 2015-05-15 3 2015-05-15
11 C 2015-07-30 No 0 7 2015-05-15 3 2015-05-15
12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03
重复日期时间 - 子
解决方案
对不起,如果这样做很复杂,也许有人会找到更好的。
我的解决方案是找到重复值,用最后一个填充它们'Yes'
并添加到date1
差异列中。
这些值在列中标识count
。其他(值1
)重置为NaN
。然后值 fromdate1
被复制到date2
column count
。
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df
ID DATE WIN WIN1 cumsum min count date1 date2
0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-07 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-07 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT
6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT
10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-08-03 2015-08-03
然后这些值按组的最小值重复并添加到date1
列中。
def repeat_value(grp):
grp['date2'] = grp['date2'].min()
return grp
df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df1['date2'].notnull() , 'date1'] = df['date2']
print df
ID DATE WIN WIN1 cumsum min count date1 date2
0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-05 2015-06-05
4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-05 2015-06-05
5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT
6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT
10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-05-15 2015-05-15
列NumOfDaysSinceLastWin
由 coldate1
和的差异填充DATE
。数据类型为Timedelta
,因此将转换为整数。最后不必要的列将被删除。(只有列WIN1
,并且count
是下一个输出列所必需的,因此不会被删除。)
df['NumOfDaysSinceLastWin'] = ((df['DATE'] - df['date1']).fillna(0)).astype('timedelta64[D]')
df = df.drop(['cumsum','min', 'date1'], axis=1 )
print df
ID DATE WIN WIN1 count NumOfDaysSinceLastWin
0 A 2015-06-05 Yes 1 2 0
1 A 2015-06-05 Yes 1 2 0
2 A 2015-06-07 Yes 1 3 2
3 A 2015-06-07 Yes 1 3 2
4 A 2015-06-07 Yes 1 3 2
5 A 2015-06-08 No 0 NaN 1
6 B 2015-06-07 No 0 NaN 0
7 B 2015-06-07 No 0 NaN 0
8 B 2015-08-07 Yes 1 NaN 0
9 C 2015-05-15 Yes 1 NaN 0
10 C 2015-05-30 No 0 NaN 15
11 C 2015-07-30 No 0 NaN 76
12 C 2015-08-03 Yes 1 2 80
13 C 2015-08-03 Yes 1 2 80
柱子NumOfWinsInThePast30days
滚动总和是你的朋友。列yes
(重采样所必需的)由1
for'Yes'
和NaN
for映射'No'
。
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
数据框df2
是副本,df
列DATE
设置为索引(用于重采样)。不必要的列将被删除。
df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1'], axis=1)
然后df2
将按天重新采样,如果行是'Yes'
值1
,如果'No'
是0
。(最好参见下面的解释。)
df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()
数据框df2
将按这些组进行分组,ID
并且函数rolling_sum
正在使用这些组。
df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)
df2
为了更好地理解,将显示所有行。
with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
print df2
ID DATE yes rollsum
0 A 2015-06-05 2 2
1 A 2015-06-06 0 2
2 A 2015-06-07 3 5
3 A 2015-06-08 0 5
4 B 2015-06-07 0 0
5 B 2015-06-08 0 0
6 B 2015-06-09 0 0
7 B 2015-06-10 0 0
8 B 2015-06-11 0 0
9 B 2015-06-12 0 0
10 B 2015-06-13 0 0
11 B 2015-06-14 0 0
12 B 2015-06-15 0 0
13 B 2015-06-16 0 0
14 B 2015-06-17 0 0
15 B 2015-06-18 0 0
16 B 2015-06-19 0 0
17 B 2015-06-20 0 0
18 B 2015-06-21 0 0
19 B 2015-06-22 0 0
20 B 2015-06-23 0 0
21 B 2015-06-24 0 0
22 B 2015-06-25 0 0
23 B 2015-06-26 0 0
24 B 2015-06-27 0 0
25 B 2015-06-28 0 0
26 B 2015-06-29 0 0
27 B 2015-06-30 0 0
28 B 2015-07-01 0 0
29 B 2015-07-02 0 0
30 B 2015-07-03 0 0
31 B 2015-07-04 0 0
32 B 2015-07-05 0 0
33 B 2015-07-06 0 0
34 B 2015-07-07 0 0
35 B 2015-07-08 0 0
36 B 2015-07-09 0 0
37 B 2015-07-10 0 0
38 B 2015-07-11 0 0
39 B 2015-07-12 0 0
40 B 2015-07-13 0 0
41 B 2015-07-14 0 0
42 B 2015-07-15 0 0
43 B 2015-07-16 0 0
44 B 2015-07-17 0 0
45 B 2015-07-18 0 0
46 B 2015-07-19 0 0
47 B 2015-07-20 0 0
48 B 2015-07-21 0 0
49 B 2015-07-22 0 0
50 B 2015-07-23 0 0
51 B 2015-07-24 0 0
52 B 2015-07-25 0 0
53 B 2015-07-26 0 0
54 B 2015-07-27 0 0
55 B 2015-07-28 0 0
56 B 2015-07-29 0 0
57 B 2015-07-30 0 0
58 B 2015-07-31 0 0
59 B 2015-08-01 0 0
60 B 2015-08-02 0 0
61 B 2015-08-03 0 0
62 B 2015-08-04 0 0
63 B 2015-08-05 0 0
64 B 2015-08-06 0 0
65 B 2015-08-07 1 1
66 C 2015-05-15 1 1
67 C 2015-05-16 0 1
68 C 2015-05-17 0 1
69 C 2015-05-18 0 1
70 C 2015-05-19 0 1
71 C 2015-05-20 0 1
72 C 2015-05-21 0 1
73 C 2015-05-22 0 1
74 C 2015-05-23 0 1
75 C 2015-05-24 0 1
76 C 2015-05-25 0 1
77 C 2015-05-26 0 1
78 C 2015-05-27 0 1
79 C 2015-05-28 0 1
80 C 2015-05-29 0 1
81 C 2015-05-30 0 1
82 C 2015-05-31 0 1
83 C 2015-06-01 0 1
84 C 2015-06-02 0 1
85 C 2015-06-03 0 1
86 C 2015-06-04 0 1
87 C 2015-06-05 0 1
88 C 2015-06-06 0 1
89 C 2015-06-07 0 1
90 C 2015-06-08 0 1
91 C 2015-06-09 0 1
92 C 2015-06-10 0 1
93 C 2015-06-11 0 1
94 C 2015-06-12 0 1
95 C 2015-06-13 0 1
96 C 2015-06-14 0 0
97 C 2015-06-15 0 0
98 C 2015-06-16 0 0
99 C 2015-06-17 0 0
100 C 2015-06-18 0 0
101 C 2015-06-19 0 0
102 C 2015-06-20 0 0
103 C 2015-06-21 0 0
104 C 2015-06-22 0 0
105 C 2015-06-23 0 0
106 C 2015-06-24 0 0
107 C 2015-06-25 0 0
108 C 2015-06-26 0 0
109 C 2015-06-27 0 0
110 C 2015-06-28 0 0
111 C 2015-06-29 0 0
112 C 2015-06-30 0 0
113 C 2015-07-01 0 0
114 C 2015-07-02 0 0
115 C 2015-07-03 0 0
116 C 2015-07-04 0 0
117 C 2015-07-05 0 0
118 C 2015-07-06 0 0
119 C 2015-07-07 0 0
120 C 2015-07-08 0 0
121 C 2015-07-09 0 0
122 C 2015-07-10 0 0
123 C 2015-07-11 0 0
124 C 2015-07-12 0 0
125 C 2015-07-13 0 0
126 C 2015-07-14 0 0
127 C 2015-07-15 0 0
128 C 2015-07-16 0 0
129 C 2015-07-17 0 0
130 C 2015-07-18 0 0
131 C 2015-07-19 0 0
132 C 2015-07-20 0 0
133 C 2015-07-21 0 0
134 C 2015-07-22 0 0
135 C 2015-07-23 0 0
136 C 2015-07-24 0 0
137 C 2015-07-25 0 0
138 C 2015-07-26 0 0
139 C 2015-07-27 0 0
140 C 2015-07-28 0 0
141 C 2015-07-29 0 0
142 C 2015-07-30 0 0
143 C 2015-07-31 0 0
144 C 2015-08-01 0 0
145 C 2015-08-02 0 0
146 C 2015-08-03 2 2
不必要的列yes
将被删除。
df2 = df2.drop(['yes'], axis=1 )
输出正在与第一个 dataframe 合并df
。
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2
ID DATE WIN WIN1 NumOfDaysSinceLastWin yes rollsum
0 A 2015-06-07 Yes 1 0 1 2
1 A 2015-06-07 Yes 1 0 1 2
2 B 2015-08-07 No 0 0 NaN 1
3 B 2015-08-07 Yes 1 0 1 1
4 C 2015-05-15 Yes 1 0 1 1
5 C 2015-05-30 No 0 15 NaN 1
6 C 2015-07-30 No 0 76 NaN 0
7 C 2015-08-03 Yes 1 80 1 1
如果 columncount
中的值不是null
,则将它们添加到 column count
。函数是用 valuesrolling_sum
计算原始行数,因此必须减去。并且这个值 ( ) 在列中。df
'YES'
1
WIN1
df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']
删除不必要的列。
df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2
ID DATE WIN NumOfDaysSinceLastWin NumOfWinsInThePast30days
0 A 2015-06-05 Yes 0 0
1 A 2015-06-05 Yes 0 0
2 A 2015-06-07 Yes 2 2
3 A 2015-06-07 Yes 2 2
4 A 2015-06-07 Yes 2 2
5 A 2015-06-08 No 1 5
6 B 2015-06-07 No 0 0
7 B 2015-06-07 No 0 0
8 B 2015-08-07 Yes 0 0
9 C 2015-05-15 Yes 0 0
10 C 2015-05-30 No 15 1
11 C 2015-07-30 No 76 0
12 C 2015-08-03 Yes 80 0
13 C 2015-08-03 Yes 80 0
最后一起:
import pandas as pd
import numpy as np
import io
#original data
temp=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes"""
#changed repeating data
temp2=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes
C,2015/8/03,Yes"""
df = pd.read_csv(io.StringIO(temp2), parse_dates = [1])
df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
df['cumsum'] = df['WIN1'].cumsum()
#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0))
df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')
df1 = df[~df['WIN'].isin(['No'])]
df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift())
print df
df.loc[(df['count'] >= 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df
#resolve repeating datetimes
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df
def repeat_value(grp):
grp['date2'] = grp['date2'].min()
return grp
df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df['date2'].notnull() , 'date1'] = df['date2']
print df
df['NumOfDaysSinceLastWin'] = (df['DATE'] - df['date1']).astype('timedelta64[D]')
df = df.drop(['cumsum','min','date1', 'date2'], axis=1 )
print df
#NumOfWinsInThePast30days
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1','count'], axis=1)
df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()
df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)
#with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
#print df2
df2 = df2.drop(['yes'], axis=1 )
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2
df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']
df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2