1

我见过不同语言(即 SQL、fortran 或 C++)的解决方案,主要用于循环。

我希望有人可以帮助我使用熊猫来解决这个任务。


如果我有一个看起来像这样的数据框。

      date  pcp  sum_count  sumcum
 7/13/2013  0.1        3.0    48.7
 7/14/2013 48.5
 7/15/2013  0.1
 7/16/2013
  8/1/2013  1.5        1.0     1.5
  8/2/2013
  8/3/2013
  8/4/2013  0.1        2.0     3.6
  8/5/2013  3.5
 9/22/2013  0.3        3.0    26.3
 9/23/2013 14.0
 9/24/2013 12.0
 9/25/2013
 9/26/2013
 10/1/2014  0.1       11.0   
 10/2/2014 96.0              135.5
 10/3/2014  2.5
 10/4/2014 37.0
 10/5/2014  9.5
 10/6/2014 26.5
 10/7/2014  0.5
 10/8/2014 25.5
 10/9/2014  2.0
10/10/2014  5.5
10/11/2014  5.5

我希望我能做到以下几点:

第 1 步:通过确定“pcp”列中连续非零的总计数来创建 sum_count 列。

第 2 步:创建 sumcum 列并计算非连续“pcp”的总和。

第 3 步:创建一个如下所示的数据透视表:

year   max_sum_count
2013   48.7
2014   135.5

但!!max_sum_count 基于 sum_count = 3 时的条件


我会很感激任何帮助!谢谢你!


更新的问题:

我之前强调过 sum_count 应该只返回最大连续 3 pcps。但是我错误地给出了错误的数据框,我不得不对其进行编辑。对不起。

135.5 的总和来自 96.0 + 2.5 + 37.0。它是 sum_count 11 内的最大连续 3 pcps。

谢谢

4

2 回答 2

1

采用:

#filtering + rolling by days
N = 3

df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')

#test NaNs
m = df['pcp'].isna()

#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]

#extract years
df['year'] = df.index.year

#filter no NaNs rows
df = df[~m].copy()

#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()


#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
                   .rolling(f'{N}D')['pcp']
                   .sum()
                   .reset_index(level=[0, 1], drop=True))

#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')

print (df1)
   year  max_sum_count
0  2013           48.7
1  2014          135.5
于 2021-09-09T07:52:35.860 回答
1

首先,转换date为真正的datetimedtype 并创建一个二进制掩码,以保留pcp不为空的行。然后您可以创建组并计算变量:

输入数据:

>>> df
          date   pcp
0    7/13/2013   0.1
1    7/14/2013  48.5
2    7/15/2013   0.1
3    7/16/2013   NaN
4     8/1/2013   1.5
5     8/2/2013   NaN
6     8/3/2013   NaN
7     8/4/2013   0.1
8     8/5/2013   3.5
9    9/22/2013   0.3
10   9/23/2013  14.0
11   9/24/2013  12.0
12   9/25/2013   NaN
13   9/26/2013   NaN
14   10/1/2014   0.1
15   10/2/2014  96.0
16   10/3/2014   2.5
17   10/4/2014  37.0
18   10/5/2014   9.5
19   10/6/2014  26.5
20   10/7/2014   0.5
21   10/8/2014  25.5
22   10/9/2014   2.0
23  10/10/2014   5.5
24  10/11/2014   5.5

代码:

df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()

grp = df.loc[mask, 'date'] \
        .ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
        .cumsum()

df = df.join(df.reset_index()
               .groupby(grp)
               .agg(index=('index', 'first'),
                    sum_count=('pcp', 'size'),
                    sumcum=('pcp', 'sum'))
               .set_index('index'))

pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
          .rename('max_sum_count').reset_index()

输出结果:

>>> df
         date   pcp  sum_count  sumcum
0  2013-07-13   0.1        3.0    48.7
1  2013-07-14  48.5        NaN     NaN
2  2013-07-15   0.1        NaN     NaN
3  2013-07-16   NaN        NaN     NaN
4  2013-08-01   1.5        1.0     1.5
5  2013-08-02   NaN        NaN     NaN
6  2013-08-03   NaN        NaN     NaN
7  2013-08-04   0.1        2.0     3.6
8  2013-08-05   3.5        NaN     NaN
9  2013-09-22   0.3        3.0    26.3
10 2013-09-23  14.0        NaN     NaN
11 2013-09-24  12.0        NaN     NaN
12 2013-09-25   NaN        NaN     NaN
13 2013-09-26   NaN        NaN     NaN
14 2014-10-01   0.1       11.0   210.6
15 2014-10-02  96.0        NaN     NaN
16 2014-10-03   2.5        NaN     NaN
17 2014-10-04  37.0        NaN     NaN
18 2014-10-05   9.5        NaN     NaN
19 2014-10-06  26.5        NaN     NaN
20 2014-10-07   0.5        NaN     NaN
21 2014-10-08  25.5        NaN     NaN
22 2014-10-09   2.0        NaN     NaN
23 2014-10-10   5.5        NaN     NaN
24 2014-10-11   5.5        NaN     NaN

>>> pivot
   date  max_sum_count
0  2013           48.7
1  2014          210.6
于 2021-09-09T07:57:05.720 回答