python - Python用线性插值正则化不规则时间序列

Question

我在熊猫中有一个时间序列，如下所示：

                     Values
1992-08-27 07:46:48    28.0  
1992-08-27 08:00:48    28.2  
1992-08-27 08:33:48    28.4  
1992-08-27 08:43:48    28.8  
1992-08-27 08:48:48    29.0  
1992-08-27 08:51:48    29.2  
1992-08-27 08:53:48    29.6  
1992-08-27 08:56:48    29.8  
1992-08-27 09:03:48    30.0

我想将其重新采样为具有 15 分钟时间步长的常规时间序列，其中值是线性插值的。基本上我想得到：

                     Values
1992-08-27 08:00:00    28.2  
1992-08-27 08:15:00    28.3  
1992-08-27 08:30:00    28.4  
1992-08-27 08:45:00    28.8  
1992-08-27 09:00:00    29.9

但是使用 Pandas 的重采样方法 (df.resample('15Min')) 我得到：

                     Values
1992-08-27 08:00:00   28.20  
1992-08-27 08:15:00     NaN  
1992-08-27 08:30:00   28.60  
1992-08-27 08:45:00   29.40  
1992-08-27 09:00:00   30.00

我尝试了使用不同的“how”和“fill_method”参数的重采样方法，但从未得到我想要的结果。我使用了错误的方法吗？

我认为这是一个相当简单的查询，但我已经在网上搜索了一段时间，但找不到答案。

提前感谢我能得到的任何帮助。

score 19 · Accepted Answer

你可以用traces做到这一点。首先，TimeSeries像字典一样使用不规则的测量值创建一个：

ts = traces.TimeSeries([
    (datetime(1992, 8, 27, 7, 46, 48), 28.0),
    (datetime(1992, 8, 27, 8, 0, 48), 28.2),
    ...
    (datetime(1992, 8, 27, 9, 3, 48), 30.0),
])

然后使用sample方法进行正则化：

ts.sample(
    sampling_period=timedelta(minutes=15),
    start=datetime(1992, 8, 27, 8),
    end=datetime(1992, 8, 27, 9),
    interpolate='linear',
)

这导致以下正则化版本，其中灰点是原始数据，橙色是带有线性插值的正则化版本。

插值是：

1992-08-27 08:00:00    28.189 
1992-08-27 08:15:00    28.286  
1992-08-27 08:30:00    28.377
1992-08-27 08:45:00    28.848
1992-08-27 09:00:00    29.891

score 14 · Accepted Answer

@mstringer 获得的相同结果可以完全在 pandas 中实现。诀窍是首先以秒为单位重新采样，使用插值填充中间值 ( .resample('s').interpolate())，然后在 15 分钟周期内上采样 ( .resample('15T').asfreq())。

import io
import pandas as pd

data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0  
1992-08-27 08:00:48,28.2  
1992-08-27 08:33:48,28.4  
1992-08-27 08:43:48,28.8  
1992-08-27 08:48:48,29.0  
1992-08-27 08:51:48,29.2  
1992-08-27 08:53:48,29.6  
1992-08-27 08:56:48,29.8  
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data, squeeze=True)
s.index = pd.to_datetime(s.index)

res = s.resample('s').interpolate().resample('15T').asfreq().dropna()
print(res)

输出：

1992-08-27 08:00:00    28.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
Freq: 15T, Name: Values, dtype: float64

score 9 · Accepted Answer

It takes a bit of work, but try this out. Basic idea is find the closest two timestamps to each resample point and interpolate. np.searchsorted is used to find dates closest to the resample point.

# empty frame with desired index
rs = pd.DataFrame(index=df.resample('15min').iloc[1:].index)

# array of indexes corresponding with closest timestamp after resample
idx_after = np.searchsorted(df.index.values, rs.index.values)

# values and timestamp before/after resample
rs['after'] = df.loc[df.index[idx_after], 'Values'].values
rs['before'] = df.loc[df.index[idx_after - 1], 'Values'].values
rs['after_time'] = df.index[idx_after]
rs['before_time'] = df.index[idx_after - 1]

#calculate new weighted value
rs['span'] = (rs['after_time'] - rs['before_time'])
rs['after_weight'] = (rs['after_time'] - rs.index) / rs['span']
# I got errors here unless I turn the index to a series
rs['before_weight'] = (pd.Series(data=rs.index, index=rs.index) - rs['before_time']) / rs['span']

rs['Values'] = rs.eval('before * before_weight + after * after_weight')

After all that, hopefully the right answer:

In [161]: rs['Values']
Out[161]: 
1992-08-27 08:00:00    28.011429
1992-08-27 08:15:00    28.313939
1992-08-27 08:30:00    28.223030
1992-08-27 08:45:00    28.952000
1992-08-27 09:00:00    29.908571
Freq: 15T, Name: Values, dtype: float64

score 3 · Accepted Answer

我最近不得不重新采样非均匀采样的加速度数据。它通常以正确的频率采样，但会间歇性地累积延迟。

我发现了这个问题，并使用纯熊猫和 numpy 结合了 mstringer 和 Alberto Garcia-Rabosco 的答案。此方法在所需频率处创建一个新索引，然后在没有以更高频率进行内插的间歇步骤的情况下进行内插。

# from Alberto Garcia-Rabosco above
import io
import pandas as pd

data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0  
1992-08-27 08:00:48,28.2  
1992-08-27 08:33:48,28.4  
1992-08-27 08:43:48,28.8  
1992-08-27 08:48:48,29.0  
1992-08-27 08:51:48,29.2  
1992-08-27 08:53:48,29.6  
1992-08-27 08:56:48,29.8  
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data, squeeze=True)
s.index = pd.to_datetime(s.index)

进行插值的代码：

import numpy as np
# create the new index and a new series full of NaNs
new_index = pd.DatetimeIndex(start='1992-08-27 08:00:00', 
    freq='15 min', periods=5, yearfirst=True)
new_series = pd.Series(np.nan, index=new_index)

# concat the old and new series and remove duplicates (if any) 
comb_series = pd.concat([s, new_series])
comb_series = comb_series[~comb_series.index.duplicated(keep='first')]

# interpolate to fill the NaNs
comb_series.interpolate(method='time', inplace=True)

输出：

>>> print(comb_series[new_index])
1992-08-27 08:00:00    28.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
Freq: 15T, dtype: float64

和以前一样，您可以使用 scipy 支持的任何插值方法，并且该技术也适用于 DataFrame（这就是我最初使用它的目的）。最后，请注意 interpolate 默认为“线性”方法，该方法忽略索引中的时间信息，并且不适用于非均匀间隔的数据。

python - Python用线性插值正则化不规则时间序列

4 回答 4

Related

Reference