1

Performing a performance profiling, I was quite surprised to identidy pd.to_datetime as a large drag to performance (62sec spent out of 91sec in my use case). So I may not be using the function as I should.

Simple example case, I need to convert timestamp = 623289600000000000L in a date/timestamp format.

import datetime
import time
import pandas as pd
timestamp = 623289600000000000L

timeit pd.to_datetime(timestamp, unit = 'ns')
10000 loops, best of 3: 46.9 us per loop

In [3]: timeit time.ctime(timestamp/10**9)
1000000 loops, best of 3: 904 ns per loop

timeit time.localtime(timestamp/10**9)
1000000 loops, best of 3: 1.13 us per loop

timeit datetime.datetime.fromtimestamp(timestamp/10**9)
1000000 loops, best of 3: 1.51 us per loop

timeit datetime.datetime.utcfromtimestamp(timestamp/10**9)
1000000 loops, best of 3: 1.29 us per loop

I awware these functions each returns a different object, however pd.to_datetime is by far the slowest. Is that expected?

I now use datetime.datetime.utcfromtimestamp in my code and it works fine. However, I would have rather keep using Pandas. Plus Pandas handles fine pre-1970 dates (see below). Would you be able to provide some guidance?

pd.to_datetime has one advantage: it support negative input / pre-1970-01-01 dates. That is also quite important for my use case.

timestamp =-445645400000000000L
pd.to_datetime(timestamp, unit = 'ns')
Timestamp('1955-11-18 01:36:40', tz=None)

datetime.datetime.utcfromtimestamp(timestamp/10**9)
Traceback (most recent call last):

  File "<ipython-input-9-99b040d30a3e>", line 1, in <module>
    datetime.datetime.utcfromtimestamp(timestamp/10**9)

ValueError: timestamp out of range for platform localtime()/gmtime() function

I use Python 2.7.5 and Pandas 0.12.0 on Windows 7.

4

3 回答 3

5

to_datetime 将通过多种方式解析时间戳参数,以找出里面的时间戳是什么。将表示日期时间的字符串转换为 Timestamp 对象很有用。

如果你正在操作的数据已经是一个 timestamp int,你可以直接调用 Timestamp 对象来构建它:

pd.Timestamp(timestamp)
Out[51]: Timestamp('1989-10-02 00:00:00', tz=None)

%timeit pd.Timestamp(timestamp)
100000 loops, best of 3: 1.96 µs per loop

它也适用于负数:

pd.Timestamp(-445645400000000000L)
Out[54]: Timestamp('1955-11-18 01:36:40', tz=None)
于 2013-10-31T11:20:34.967 回答
4

如果您有重复的日期时间值要转换,则使用以下函数在 pandas 中进行日期解析会使事情变得非常快。

基准:

$ python date-parse.py
to_datetime: 5799 ms
dateutil:    5162 ms
strptime:    1651 ms
manual:       242 ms
lookup:        32 ms

def lookup(s):
    """
    This is an extremely fast approach to datetime parsing.
    For large data, the same dates are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all dates.
    """
    dates = {date:pd.to_datetime(date) for date in s.unique()}
    return s.apply(lambda v: dates[v])

而且,来源。

于 2015-04-25T21:20:44.183 回答
1

转换单个时间戳不是有效的比较,只是函数调用次数的度量。

In [9]: arr = [timestamp] * 1000000

In [10]: %timeit pd.to_datetime(arr,unit='ns')
1 loops, best of 3: 234 ms per loop

In [12]: arr = (np.array(arr)/10**9).tolist()

In [13]: %timeit [ time.ctime(x) for x in arr ]
1 loops, best of 3: 1.6 s per loop

In [31]: f = datetime.datetime.utcfromtimestamp

In [32]: %timeit [ f(x) for x in arr ]
1 loops, best of 3: 643 ms per loop

很明显,当应用于非平凡数据集时,使用矢量化方法要快得多。

于 2013-10-31T11:24:49.953 回答