1

我在 Pandas 中有一个数据框。

其中一列是时间戳。我使用以下方法从数据中删除所有周末:

df = df[df['TIMESTAMP'].apply(pd.datetime.weekday)<5]

代码需要 9 秒才能运行。有没有更快的方法来做到这一点?

提前致谢。

4

2 回答 2

2

为了完整...

In [1]: df = DataFrame(randn(100000,2),columns=list('AB'))

In [6]: df['time'] = date_range('19700101',periods=100000)

In [7]: df.tail()
Out[7]: 
              A         B                time
99995  0.481596 -0.622861 2243-10-12 00:00:00
99996 -1.000646  0.415413 2243-10-13 00:00:00
99997  0.054219 -0.669477 2243-10-14 00:00:00
99998 -1.246848  0.690656 2243-10-15 00:00:00
99999 -2.186820 -0.597221 2243-10-16 00:00:00

In [8]: df.head()
Out[8]: 
          A         B                time
0 -0.011530 -0.609354 1970-01-01 00:00:00
1  0.652302 -0.229030 1970-01-02 00:00:00
2 -1.703967  0.880957 1970-01-03 00:00:00
3  2.000682 -1.250603 1970-01-04 00:00:00
4  0.483412  2.233786 1970-01-05 00:00:00

In [10]: pd.DatetimeIndex(df.time).weekday
Out[10]: array([3, 4, 5, ..., 5, 6, 0], dtype=int32)

In [11]: df[pd.DatetimeIndex(df.time).weekday<5]
Out[11]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 71428 entries, 0 to 99999
Data columns (total 3 columns):
A       71428  non-null values
B       71428  non-null values
time    71428  non-null values
dtypes: datetime64[ns](1), float64(2)

In [12]: df[pd.DatetimeIndex(df.time).weekday<5].head()
Out[12]: 
          A         B                time
0 -0.011530 -0.609354 1970-01-01 00:00:00
1  0.652302 -0.229030 1970-01-02 00:00:00
4  0.483412  2.233786 1970-01-05 00:00:00
5  0.264460 -0.135544 1970-01-06 00:00:00
6  0.037285  0.592312 1970-01-07 00:00:00

In [13]: %timeit  df[pd.DatetimeIndex(df.time).weekday<5]
10 loops, best of 3: 41.4 ms per loop
于 2013-07-23T15:43:49.177 回答
2

更快的替代方法是首先将 Series 转换为 a DatetimeIndex(具有weekday属性):

df[pd.DatetimeIndex(df['TIMESTAMP']).weekday < 5]
于 2013-07-23T15:41:33.397 回答