3

我有一个数据DatetimeIndex框,我想找到每个窗口的最大元素。但我也必须知道元素的索引。示例数据:

data = pd.DataFrame(
    index=pd.date_range(start=pd.to_datetime('2010-10-10 12:00:00'),
                        periods=10, freq='H'),
    data={'value': [3, 2, 1, 0, 5, 1, 1, 1, 1, 1]}
)

如果我使用最大滚动,我会丢失索引:

data.rolling(3).max()

出去:

                     value
2010-10-10 12:00:00    NaN
2010-10-10 13:00:00    NaN
2010-10-10 14:00:00    3.0
2010-10-10 15:00:00    2.0
2010-10-10 16:00:00    5.0
2010-10-10 17:00:00    5.0
2010-10-10 18:00:00    5.0
2010-10-10 19:00:00    1.0
2010-10-10 20:00:00    1.0
2010-10-10 21:00:00    1.0

如果我尝试使用 argmax,我会在每个窗口中将索引作为整数索引(但我必须找到源日期时间索引或源数据帧的整数索引才能找到它们iloc):

data.rolling(3).apply(lambda x: x.argmax())

出去:

                     value
2010-10-10 12:00:00    NaN
2010-10-10 13:00:00    NaN
2010-10-10 14:00:00    0.0
2010-10-10 15:00:00    0.0
2010-10-10 16:00:00    2.0
2010-10-10 17:00:00    1.0
2010-10-10 18:00:00    0.0
2010-10-10 19:00:00    0.0
2010-10-10 20:00:00    0.0
2010-10-10 21:00:00    0.0

谁能帮我在熊猫中找到好的功能/参数?

当然我可以for像这样使用:

pd.DataFrame([{'value_max': data[ind: ind + window][target_var].max(),
               'source_index': data[ind: ind + window].index[data[ind: ind + window][target_var].values.argmax()]
              } for ind in range(1, len(data) + 1 - window)],
             index=data.index[1:-window+1])

它有效。但我想尝试用熊猫找到更优雅的解决方案。

期望的输出:

                           source_index  value_max
2010-10-10 13:00:00 2010-10-10 13:00:00          2
2010-10-10 14:00:00 2010-10-10 16:00:00          5
2010-10-10 15:00:00 2010-10-10 16:00:00          5
2010-10-10 16:00:00 2010-10-10 16:00:00          5
2010-10-10 17:00:00 2010-10-10 17:00:00          1
2010-10-10 18:00:00 2010-10-10 18:00:00          1
2010-10-10 19:00:00 2010-10-10 19:00:00          1
4

1 回答 1

3

Resampler.agg与自定义函数一起使用,因为idxmax尚未实现resampler

def idx(x):
    return x.index.values[np.argmax(x.values)]

df = data['value'].rolling(3).agg(['max', idx])
df['idx'] = pd.to_datetime(df['idx'])
print (df)
                     max                 idx
2010-10-10 12:00:00  NaN                 NaT
2010-10-10 13:00:00  NaN                 NaT
2010-10-10 14:00:00  3.0 2010-10-10 12:00:00
2010-10-10 15:00:00  2.0 2010-10-10 13:00:00
2010-10-10 16:00:00  5.0 2010-10-10 16:00:00
2010-10-10 17:00:00  5.0 2010-10-10 16:00:00
2010-10-10 18:00:00  5.0 2010-10-10 16:00:00
2010-10-10 19:00:00  1.0 2010-10-10 17:00:00
2010-10-10 20:00:00  1.0 2010-10-10 18:00:00
2010-10-10 21:00:00  1.0 2010-10-10 19:00:00

谢谢@Sandeep Kadapa 改进解决方案:

def idx(x):
    return x.idxmax().to_datetime64()
于 2018-12-28T09:48:42.233 回答