1

假设我在 Pandas 数据框中有两列时间序列数据,分别为“a”和“b”。我想创建第三列,指示当前时间段的“a”列与接下来 5 个时间段中的任何一个的“b”列之间的差异是否增加了 8 或更多,然后减少了 2 或更多。理想情况下,我会使用某种形式的 df.rolling(5).apply() 并且没有任何循环,但我一直遇到挑战。

为了演示起见,我用循环写出了逻辑,但如果有人能给我一些指导,告诉我如何更有效或更优雅地做到这一点,我将不胜感激。实际上,数据框和窗口会大得多。

df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10], 'b':[1,0,9,0,15,0,20,15,23,6]})
df['c'] = 0

window = 5
positive_thresh = 8
negative_thresh = -2
num_rows = df.shape[0]

for a_idx in range(num_rows):
    a_start = df.iloc[a_idx,0]
    b_roll = df.iloc[(a_idx + 1):max(a_idx + 1 + window,num_rows), 1]
    deltas = b_roll - a_start
    positives = deltas[deltas>=positive_thresh]
    negatives = deltas[deltas<=negative_thresh]
    first_pos_idx = positives.index[0] if len(positives) > 0 else num_rows
    first_neg_idx = negatives.index[0] if len(negatives) > 0 else num_rows
    
    if first_pos_idx < first_neg_idx:
        df.iloc[a_idx,2] = 1

print(df)

    a   b  c
0   1   1  1
1   2   0  0
2   3   9  0
3   4   0  1
4   5  15  0
5   6   0  1
6   7  20  1
7   8  15  1
8   9  23  0
9  10   6  0

4

1 回答 1

0

仅使用口罩就很难处理,但这是一种方法:

from numpy.lib.stride_tricks import sliding_window_view

window = 5
n_rows = df.shape[0]

dfa = df.reindex(np.arange(df.shape[0] + window))  # Just so that the sliding view matches
b_roll = sliding_window_view(dfa.b, 5)[1:]
diff = (b_roll.T - df.a.values).T  # diff next 5 "b" rows  with current "a"

pos = (diff >= 8)
pos_idx = pos.argmax(1)
pos_idx[pos.sum(1) == 0] = n_rows  # differ first idx vs. no occurences found

neg = (diff <= -2)
neg_idx = window - neg[:, ::-1].argmax(1) - 1  # getting the last occurence col-wise
neg_idx[neg.sum(1) == 0] = 0  # differ first idx vs. no occurences found

df["c"] = (pos_idx < neg_idx).astype(int)

如果您注意到,我建议的输出与您的不完全匹配。我相信您的代码段不能完全代表您的描述,但我可能只是误解了逻辑中的某些内容。

输出:

    a   b  c
0   1   1  0
1   2   0  1
2   3   9  1
3   4   0  1
4   5  15  0
5   6   0  0
6   7  20  0
7   8  15  1
8   9  23  0
9  10   6  0
于 2022-01-28T16:01:00.927 回答