2

This is a contrived example to keep the data generation easy, but in general this should be a problem applicable to a wide audience.

I have a time-series of measurements like so:

In [1]: import pandas as pd

In [2]: index = pd.date_range(start="18:10",periods=20,freq='min')

In [3]: df = pd.DataFrame(randn(20,3),columns=list('abc'),index=index)

In [4]: df.head()
Out[4]: 
                            a         b         c
2013-02-27 18:10:00 -1.344753  0.438351  1.561849
2013-02-27 18:11:00  1.715643  1.601984 -0.027408
2013-02-27 18:12:00 -0.142264 -0.049462  0.482493
2013-02-27 18:13:00  0.132617  0.737902 -0.347620
2013-02-27 18:14:00  1.277257  0.083401  0.649422

In between the 'real' measurements, calibration measurements are being done, but at a much lesser frequency than the measurements, e.g. something like this:

In [5]: calindex = pd.date_range("18:12:30",periods=4,freq='5min')

In [6]: caldata = pd.Series([10,20,30,40],index = calindex)

In [7]: caldata
Out[7]: 
2013-02-27 18:12:30    10
2013-02-27 18:17:30    20
2013-02-27 18:22:30    30
2013-02-27 18:27:30    40
Freq: 5T

The general idea now is to apply these calibration data to the measurements. For this, I would like to distribute / broadcast the calibration data by a 'closest-time' approach, so I would like to generate another column called 'offsets' for example, that has that calibration value in each row of the measurements that was determined closest in time to the time of each measurement value.

Therefore I am after an end result like this:

In [14]: df
Out[14]: 
                            a         b         c  offsets
2013-02-27 18:10:00 -1.344753  0.438351  1.561849       10
2013-02-27 18:11:00  1.715643  1.601984 -0.027408       10
2013-02-27 18:12:00 -0.142264 -0.049462  0.482493       10
2013-02-27 18:13:00  0.132617  0.737902 -0.347620       10
2013-02-27 18:14:00  1.277257  0.083401  0.649422       10
2013-02-27 18:15:00  0.048120  0.421220  0.149372       20
2013-02-27 18:16:00  0.812317 -1.517389  2.035487       20
2013-02-27 18:17:00 -0.058959 -0.034876 -1.535118       20
2013-02-27 18:18:00 -0.666227  0.040208 -1.042464       20
2013-02-27 18:19:00 -0.077031 -0.158351 -0.441992       20
2013-02-27 18:20:00  0.103083 -0.129341  0.294073       30
2013-02-27 18:21:00  0.900802  0.443271 -0.946229       30
2013-02-27 18:22:00  0.744631 -0.058666 -0.386226       30
2013-02-27 18:23:00 -0.064313  0.500321 -0.536237       30
2013-02-27 18:24:00 -0.392653  0.789827  0.000109       30
2013-02-27 18:25:00  1.926765  0.252259 -0.051475       40
2013-02-27 18:26:00 -0.035577  0.559222 -0.290751       40
2013-02-27 18:27:00  1.726165  0.626515 -0.868177       40
2013-02-27 18:28:00  1.269409  1.520980 -0.181637       40
2013-02-27 18:29:00 -1.151166 -0.300196  0.420747       40

The application of values into other columns via .map, .apply, etc. I believe to understand well, it is the apparently required time or offset trickery one needs to do for the distribution of the values that I don't have a clue what to start with.

Should it maybe be attacked with pandas.DateOffsets? Is there machinery to minimize time-deltas inside pandas somewhere?

I would appreciate a nudge into the right direction, doesn't have to be complete at all, just the direction where I need to be going.

4

1 回答 1

3

我使用 numpy 函数来计算最近的时间位置:

from numpy.random import randn
import numpy as np
import pandas as pd

index = pd.date_range(start="18:10",periods=20,freq='min')
df = pd.DataFrame(randn(20,3),columns=list('abc'),index=index)
calindex = pd.date_range("18:12:30",periods=4,freq='5min')
caldata = pd.Series([10,20,30,40],index = calindex)

# if you use numpy 1.7
real_time = df.index.values
cali_time = caldata.index.values

# if you use numpy 1.6
real_time = np.array(df.index.values.view("i8") / 1000, dtype="datetime64[us]")
cali_time = np.array(caldata.index.values.view("i8") / 1000, dtype="datetime64[us]")

right_index = cali_time.searchsorted(real_time, side="left")
left_index = np.clip(right_index - 1, 0, len(caldata)-1)
right_index = np.clip(right_index, 0, len(caldata)-1)
left_time = cali_time[left_index]
right_time = cali_time[right_index]
left_diff = np.abs(left_time - real_time)
right_diff = np.abs(right_time - real_time)
caldata2 = caldata[np.where(left_diff < right_diff, left_time, right_time)]
df["offset"] = caldata2.values
于 2013-02-28T04:54:41.187 回答