2

I have two data sets that I am trying to compare. One is measured meteorological values which are measured approximately every 15 minutes, but not at a consistent time each hour (i.e. 12:03, 1:05, 2:01 etc.). The other data set is modelled data for the location on the hour exactly. I would like to extract the value from the measured data which is occurs closest to the hour mark to join with the modelled data.

I currently have both sets as a DataFrame format and have created an hourly time series to use as an index. Does anyone know of an easy way to align these without looping through all the data?

Thanks.

Using the df.resample('H', how='ohlc') method, I get the following error:

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    df.resample('H', how='ohlc')
  File "C:\Python33\lib\site-packages\pandas\core\generic.py", line 290, in resample
    return sampler.resample(self)
  File "C:\Python33\lib\site-packages\pandas\tseries\resample.py", line 83, in resample
    rs = self._resample_timestamps(obj)
  File "C:\Python33\lib\site-packages\pandas\tseries\resample.py", line 226, in _resample_timestamps
    result = grouped.aggregate(self._agg_method)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1695, in aggregate
    return getattr(self, arg)(*args, **kwargs)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 427, in ohlc
    return self._cython_agg_general('ohlc')
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1618, in _cython_agg_general
    new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1656, in _cython_agg_blocks
    result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 818, in aggregate
    raise NotImplementedError
NotImplementedError

A sample of my dataframe is shown below:

                              D
2008-01-01 00:01:00  274.261108
2008-01-01 00:11:00  273.705566
2008-01-01 00:31:00  273.705566
2008-01-01 00:41:00  273.705566
2008-01-01 01:01:00  273.705566
2008-01-01 01:11:00  273.705566
2008-01-01 01:31:00  273.705566
2008-01-01 01:41:00  273.705566
2008-01-01 02:01:00  273.705566
2008-01-01 02:11:00  273.149994

EDIT: It appears this may be an error when using python 3.3. Can anyone confirm this?

4

1 回答 1

2

我认为pandas.DataFrame.resample()是你需要的。您可以检查您想要的重采样方法,例如,检查“ohlc”:

>>> df = pd.DataFrame({'data':[1,4,3,2,7,3]}, index=pd.DatetimeIndex(['2013-11-05 12:03', '2013-11-05 12:14','2013-11-05 12:29','2013-11-05 12:46','2013-11-05 13:01','2013-11-05 13:16']))
>>> df.resample('H', how='ohlc')
                     data                  
                     open  high  low  close
2013-11-05 12:00:00     1     4    1      2
2013-11-05 13:00:00     7     7    3      3

之后,您需要做的就是使用pandas.DataFrame.join()

更新很奇怪,在你的DataFrame上试过:

>>> df = pd.DataFrame({'D':[274.261108,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.705566,273.149994]})
>>> df.index = pd.DatetimeIndex(['2008.01.01 00:01:00','2008.01.01 00:11:00','2008.01.01 00:31:00','2008.01.01 00:41:00','2008.01.01 01:01:00','2008.01.01 01:11:00','2008.01.01 01:31:00','2008.01.01 01:41:00','2008.01.01 02:01:00','2008.01.01 02:11:00'])
>>> df.resample('H', how='ohlc')
                              D                                    
                           open        high         low       close
2008-01-01 00:00:00  274.261108  274.261108  273.705566  273.705566
2008-01-01 01:00:00  273.705566  273.705566  273.705566  273.705566
2008-01-01 02:00:00  273.705566  273.705566  273.149994  273.149994

工作正常。

于 2013-11-05T18:43:00.140 回答