3

我有一个按时间顺序排列的大型 datetime.date 对象数组。该数组中的许多日期都是相同的,但是缺少一些日期……(这是“真实数据”的时间序列,因此很混乱)。

我想计算每个日期有多少数据点,目前我这样做:

import datetime as dt
import numpy as np

t = np.array([dt.date(2012,12,1) + dt.timedelta(n) for n in np.arange(0,31,0.25)])

Ndays = (t[-1] - t[0]).days

data_per_day = np.array([sum(t == t[0] + dt.timedelta(d)) for d in xrange(Ndays)])

但是我发现这非常慢!(大约 400,000 个数据点超过 10 分钟)有没有更快的方法?

4

3 回答 3

2

使用np.datetime64. 在数据上@Hans 然后我得到 241 毫秒。

In [1]: import numpy as np

In [2]: import datetime as dt

In [3]: t = np.array([dt.date(2012,12,1) + dt.timedelta(n)
                        for n in np.arange(0,31,0.00001)])

In [4]: t = t.astype(np.datetime64)

In [5]: daterange = np.arange(t[0], t[-1], dtype='datetime64[D]')

In [6]: np.bincount(daterange.searchsorted(t))
Out[6]: 
array([100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000])

In [7]: %timeit np.bincount(daterange.searchsorted(t))
1 loops, best of 3: 241 ms per loop
于 2013-05-30T12:44:11.820 回答
0

对于 3,100,000 个条目,这将在几秒钟内运行。

import datetime as dt
import numpy as np
from collections import Counter

t = np.array([dt.date(2012,12,1) + dt.timedelta(n) for n in np.arange(0,31,0.00001)])

c = Counter(t)
print c
于 2013-05-30T12:23:51.750 回答
0

这是一个基于检测唯一日期之间距离的解决方案

# Get the unique day indexes of t
indexes = np.hstack(([-1], np.nonzero(np.diff(t))[0], [len(t)-1]))
# Determine how many data points are for that day
lengths = np.hstack(np.diff(indexes))

# Pull out the actual dates for the new days
dates = t[indexes[:-1] +1]
# Convert them to indexes (or day offsets)
as_int = np.vectorize(lambda d : d.day)(dates) -1

# Make a np array of these lengths
data_per_day = np.zeros((Ndays + 1,), np.int)
data_per_day[as_int] = lengths
于 2013-05-30T12:43:57.497 回答