这是不平凡的。我将在下面解释原因
Prelim,读入原始数据框并确保该ts
列是 dtypedatetime64[ns]
# you may need to do this to get the correct dtype
df['ts'] = df['ts'].to_datetime(df['ts'])
In [107]: df
Out[107]:
uuid site ts visit
0 +CW99 1124 2013-06-24 00:00:00 2
1 +CW99 1124 2013-06-26 00:00:00 1
2 +CW99 1124 2013-06-27 00:00:00 1
3 +CW99 1124 2013-06-20 00:00:00 1
4 +CW99 1124 2013-06-21 00:00:00 1
5 +CW99 1124 2013-06-24 00:00:00 2
6 +CW9W 956 2013-06-21 00:00:00 4
7 +CW9W 956 2013-06-22 00:00:00 2
8 +CW9W 956 2013-06-23 00:00:00 3
9 +CW9W 956 2013-06-24 00:00:00 4
In [106]: df.dtypes
Out[106]:
uuid object
site int64
ts datetime64[ns]
visit int64
dtype: object
在 min 和 max 之间创建主时间
In [110]: all_ts = pd.date_range(df['ts'].min(),df['ts'].max())
In [111]: all_ts
Out[111]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-06-20 00:00:00, ..., 2013-06-27 00:00:00]
Length: 8, Freq: D, Timezone: None
定义这样的函数
In [103]: def f(x):
# we want all of the ``ts`` column that are not in the master time series
.....: adf = DataFrame(dict(ts = all_ts-Index(x['ts'])),columns=df.columns)
# they should have visit of 0
.....: adf['visit'] = 0
# first add them to the frame (x), ignoring the index
# sort by the ts column
# then fillforward missing values
.....: return x.append(adf,ignore_index=True).sort_index(by='ts').ffill()
.....:
应用该功能(如果需要,您也可以按 uuid、站点分组)
In [116]: df.groupby('uuid').apply(f)
Out[116]:
uuid site ts visit
uuid
+CW99 3 +CW99 1124 2013-06-20 00:00:00 1
4 +CW99 1124 2013-06-21 00:00:00 1
0 +CW99 1124 2013-06-24 00:00:00 2
5 +CW99 1124 2013-06-24 00:00:00 2
6 +CW99 1124 2013-06-25 00:00:00 0
1 +CW99 1124 2013-06-26 00:00:00 1
2 +CW99 1124 2013-06-27 00:00:00 1
+CW9W 0 +CW9W 956 2013-06-21 00:00:00 4
1 +CW9W 956 2013-06-22 00:00:00 2
2 +CW9W 956 2013-06-23 00:00:00 3
3 +CW9W 956 2013-06-24 00:00:00 4
4 +CW9W 956 2013-06-25 00:00:00 0
注意:您在发布的框架中有一个副本。不确定这是否是故意的,我保留了它。如果您没有重复项(在ts
列中) ,这是一个更容易的问题
这是不重复的方式
In [207]: def f(x):
.....: x = x.set_index('ts').reindex(all_ts).reset_index()
.....: x['visit'] = x['visit'].fillna(0)
.....: return x.ffill()
.....:
In [208]: df_no_dups.groupby('uuid').apply(f)
Out[208]:
index uuid site visit
uuid
+CW99 0 2013-06-20 00:00:00 +CW99 1124 1
1 2013-06-21 00:00:00 +CW99 1124 1
2 2013-06-22 00:00:00 +CW99 1124 0
3 2013-06-23 00:00:00 +CW99 1124 0
4 2013-06-24 00:00:00 +CW99 1124 2
5 2013-06-25 00:00:00 +CW99 1124 0
6 2013-06-26 00:00:00 +CW99 1124 1
7 2013-06-27 00:00:00 +CW99 1124 1
+CW9W 0 2013-06-20 00:00:00 NaN NaN 0
1 2013-06-21 00:00:00 +CW9W 956 4
2 2013-06-22 00:00:00 +CW9W 956 2
3 2013-06-23 00:00:00 +CW9W 956 3
4 2013-06-24 00:00:00 +CW9W 956 4
5 2013-06-25 00:00:00 +CW9W 956 0
6 2013-06-26 00:00:00 +CW9W 956 0
7 2013-06-27 00:00:00 +CW9W 956 0
这迫使所有元素都在那里(注意NaN
因为没有办法ffill
在第一个元素上)。如果你愿意,你可以放弃这些。