最简单和最快的方法是转换为numpy array
byto_numpy
然后转换:
df['month'] = df['purchase_date'].to_numpy().astype('datetime64[M]')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
如果每月的第一天,使用floor
andpd.offsets.MonthBegin(1)
和添加正确输出的另一种解决方案:pd.offsets.MonthEnd(0)
df['month'] = (df['purchase_date'].dt.floor('d') +
pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
df['month'] = ((df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
.dt.floor('d'))
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
最后一个解决方案month period
由以下人员创建to_period
:
df['month'] = df['purchase_date'].dt.to_period('M')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01
1 2 2015-02-05 05:07:30 2015-02
2 3 2015-02-18 17:08:51 2015-02
3 4 2015-03-21 17:07:30 2015-03
4 5 2015-03-11 18:32:56 2015-03
5 6 2015-03-03 11:02:30 2015-03
...然后到datetimes
by to_timestamp
,但它有点慢:
df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
有很多解决方案,所以:
时间安排(在熊猫 1.2.3 中):
rng = pd.date_range('1980-04-01 15:41:12', periods=100000, freq='20H')
df = pd.DataFrame({'purchase_date': rng})
print (df.head())
In [70]: %timeit df['purchase_date'].to_numpy().astype('datetime64[M]')
8.6 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [71]: %timeit df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(n=0) - pd.offsets.MonthBegin(n=1)
23 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [72]: %timeit (df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)).dt.floor('d')
23.6 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [73]: %timeit df['purchase_date'].dt.to_period('M')
9.25 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [74]: %timeit df['purchase_date'].dt.to_period('M').dt.to_timestamp()
17.6 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [76]: %timeit df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
23.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [77]: %timeit df['purchase_date'].dt.normalize().map(MonthBegin().rollback)
1.66 s ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)