pandas - 如何包装熊猫重采样方法？

Question

我有一个反复出现的pandas问题，我想通过包装.resample方法来解决。我就是想不通，怎么弄。

背景（非必要）

我有时区感知时间序列，例如：

s = pd.Series([5,19,-4], pd.date_range('2020-10-01', freq='D', periods=3, tz='Europe/Berlin', name='ts_left'))

s

ts_left
2020-10-01 00:00:00+02:00    5
2020-10-02 00:00:00+02:00   19
2020-10-03 00:00:00+02:00   -4
Freq: D, dtype: int64

我想重新采样到几个小时。如果我只使用s.resample('H').sum()，最后 23 小时将被丢弃（也在这个问题中解决）：

s.resample('H').sum()

ts_left
2020-10-01 00:00:00+02:00    5
2020-10-01 01:00:00+02:00    0
...
2020-10-01 23:00:00+02:00    0
2020-10-02 00:00:00+02:00   19
2020-10-02 01:00:00+02:00    0
...
2020-10-02 23:00:00+02:00    0
2020-10-03 00:00:00+02:00   -4
Freq: H, Length: 49, dtype: int64

当前的“解决方案”

我编写了一个自定义resample2函数来纠正这个问题：

def resample2(df, freq, func):
    if type(df.index) != pd.DatetimeIndex:
        return df.resample(freq).apply(func)
    else: 
        #add one row
        idx = [df.index[-1] + df.index.freq]
        if type(df) == pd.DataFrame:
            df = df.append(pd.DataFrame([[None] * len(df.columns)], idx))
        elif type(df) == pd.Series:
            df = df.append(pd.Series([None], idx))
        df = df.resample(freq).apply(func)
        return df.iloc[:-1] #remove one row

这有效：

resample2(s, 'H', np.sum)

2020-10-01 00:00:00+02:00    5
2020-10-01 01:00:00+02:00    0
...
2020-10-01 23:00:00+02:00    0
2020-10-02 00:00:00+02:00   19
2020-10-02 01:00:00+02:00    0
...
2020-10-02 23:00:00+02:00    0
2020-10-03 00:00:00+02:00   -4
2020-10-03 01:00:00+02:00    0
...
2020-10-03 23:00:00+02:00    0
Freq: H, Length: 72, dtype: int64

但有两个问题：

用法与标准用法（resample2(s, 'H', np.sum)vss.resample('H').sum()和
我无法使用以前可以使用的所有功能。例如，resample2(s, 'H', s.resample.ffill)给出一个错误。

问题

有没有办法包装DataFrame.resample和Series.resample方法的功能，以便它们可以像往常一样继续工作，只需在我的函数中显示“在重采样前追加一行，重采样后删除最后一行”resample2功能？

score 0 · Accepted Answer

问题1（用法与标准用法有很大不同）：

没有在本地定制您的pandas包裹，我认为您正在做的事情接近您可以做的最好的事情。我不知道resample允许这样做的任何参数，并且我不确定如何自定义 DataFrame / Series 的现有方法。

但是可能有一种方法可以使您的函数更像是一个助手，用于对与重采样相关的数据进行预处理或后处理。这是您的功能的另一种实现：

def allday_resample(df, freq, func):
    df = df.copy()
    begin = df.index.min().floor('D')
    end = df.index.max().ceil('D')
    if end == df.index.max():
        end += pd.offsets.Day(1)

    if begin not in df.index:
        df.loc[begin] = np.nan
    if end not in df.index:
        df.loc[end] = np.nan

    r = df.resample(freq).apply(func)
    return r[(r.index >= begin) &
             (r.index < end)]

这与您的非常相似，resample2但有一些变化（改进？）：

使用df = df.copy()，很明显我们正在返回一个新对象，而不是修改传入的原始数据（这可以更改）
它处理 Series 和 DataFrame 相同（因此不需要if-else）
它给出了开始日和结束日的完整值——我看到resample2如果你的开始/结束时间戳不是在午夜（如果你的数据总是在午夜，这可能是没有意义的）。看这个例子：

# now starting at 10:00
>>> s = pd.Series([5,19,-4], pd.date_range('2020-10-01 10:00', freq='D', periods=3, tz='Europe/Berlin', name='ts_left'))
>>> resample2(s, 'H', np.sum)

2020-10-01 10:00:00+02:00     5
2020-10-01 11:00:00+02:00     5
2020-10-01 12:00:00+02:00     5
2020-10-01 13:00:00+02:00     5
2020-10-01 14:00:00+02:00     5
                             ..
2020-10-04 05:00:00+02:00    -4
2020-10-04 06:00:00+02:00    -4
2020-10-04 07:00:00+02:00    -4
2020-10-04 08:00:00+02:00    -4
2020-10-04 09:00:00+02:00    -4
Freq: H, Length: 72, dtype: object

# missing timestamps for Oct 1st, and timestamps carried over into Oct 4th despite no original data on that day

我之所以这样称呼allday_resample它，是因为它确保了开始日、结束日以及其间的所有日子都充满了 input freq。如果您想重新采样到分钟，这可能会更复杂，但只想将数据填充到小时（您需要选择时间频率偏移的层次结构）。但我现在假设您只关心获取每日数据并每小时重新采样。

>>> s = pd.Series([5,19,-4], pd.date_range('2020-10-01', freq='D', periods=3, tz='Europe/Berlin', name='ts_left'))
>>> allday_resample(s, 'H', np.sum)
ts_left
2020-10-01 00:00:00+02:00    5.0
2020-10-01 01:00:00+02:00    0.0
2020-10-01 02:00:00+02:00    0.0
2020-10-01 03:00:00+02:00    0.0
2020-10-01 04:00:00+02:00    0.0

2020-10-03 19:00:00+02:00    0.0
2020-10-03 20:00:00+02:00    0.0
2020-10-03 21:00:00+02:00    0.0
2020-10-03 22:00:00+02:00    0.0
2020-10-03 23:00:00+02:00    0.0
Freq: H, Length: 72, dtype: float64

但是我们可以将其步骤移到一个函数中，以便在重新采样之前编辑我们的数据，这样当我们重新采样时，我们会得到相同的输出：

def preprocess(df):
    begin = df.index.min().floor('D')
    end = df.index.max().ceil('D')
    if end == df.index.max():
        end += pd.offsets.Day(1) - pd.Timedelta('1s')
    if begin not in df.index:
        df.loc[begin] = np.nan
    if end not in df.index:
        df.loc[end] = np.nan

在这里，传入的数据被原地修改（并且函数不返回任何内容）。还有一个小步骤是从结束日期的上限中减去 1 秒（一个任意的小增量），这样我们在重新采样时就不会包含第二天的任何数据。

使用此功能，您可以：

>>> preprocess(s)
>>> s.resample('H').sum()

ts_left
2020-10-01 00:00:00+02:00    5.0
2020-10-01 01:00:00+02:00    0.0
2020-10-01 02:00:00+02:00    0.0
2020-10-01 03:00:00+02:00    0.0
2020-10-01 04:00:00+02:00    0.0

2020-10-03 19:00:00+02:00    0.0
2020-10-03 20:00:00+02:00    0.0
2020-10-03 21:00:00+02:00    0.0
2020-10-03 22:00:00+02:00    0.0
2020-10-03 23:00:00+02:00    0.0
Freq: H, Length: 72, dtype: float64

问题 2（我不能使用以前可以使用的所有功能）：

这不那么棘手 -您仍然可以通过使用它们的字符串名称而不是其他一些函数（例如np.sum在您的示例中）来访问它们。因此，对于前向填充，您可以执行以下操作（按resample2原样）：

>>> resample2(s, 'H', 'ffill')
2020-10-01 00:00:00+02:00     5
2020-10-01 01:00:00+02:00     5
2020-10-01 02:00:00+02:00     5
2020-10-01 03:00:00+02:00     5
2020-10-01 04:00:00+02:00     5
                             ..
2020-10-03 19:00:00+02:00    -4
2020-10-03 20:00:00+02:00    -4
2020-10-03 21:00:00+02:00    -4
2020-10-03 22:00:00+02:00    -4
2020-10-03 23:00:00+02:00    -4
Freq: H, Length: 72, dtype: object

通过我的眼睛/简要测试，做x.resample().sum()和x.resample().apply('sum')是等效的。在此处查看我的问题和其他人对此的回答。并查看Resampler.apply(). . 以上，当我使用时np.sum，我可以使用'sum'.

pandas - 如何包装熊猫重采样方法？

背景（非必要）

当前的“解决方案”

问题

1 回答 1

问题1（用法与标准用法有很大不同）：

问题 2（我不能使用以前可以使用的所有功能）：

Related

Reference