python - 如何每两个月按数据框分组

Question

我有一个包含年、月和分数的数据框。例如：

df = pd.DataFrame({'year' : [2020, 2020, 2021, 2021], 
               'month': [1, 2, 3, 4],
               'score': [10,20,30,40]})

我想按年和每两个月分组。分组后的数据框应包含：年份、两个月（例如 1-2、3-4 等）和平均分数。

我在其他可以映射的答案中找到：

months = { '1' : 'B1',
  '2' : 'B1',
  '3' : 'B2',
  '4' : 'B2',
  '5' : 'B3',
  '6' : 'B3',
  '7' : 'B4',
  '8' : 'B4',
  '9' : 'B5',
  '10' : 'B5',
  '11' : 'B6',
  '12' : 'B6' }
   
df['two_months'] = df['month'].astype(str).map(months)

然后我可以分组：

df(['year','two_months'])[['score']].mean()

问题是 thentwo_months是一个字符串，我失去了对它进行排序的选项，就像对 datetime 对象所做的那样。我的问题：还有其他方法可以执行此操作吗？

score 2 · Accepted Answer

第一个想法是使用一些数学减法1和整数除法2：

s = (df['month'] - 1) // 2 + 1
df0 = df.groupby(['year', s.rename('two_months')])['score'].mean()
print (df0)
year  two_months
2020  1             15
2021  2             35
Name: score, dtype: int64

或创建datetimes 并使用Grouper：

df['date'] = pd.to_datetime(df[['month', 'year']].assign(day=1))

df1 = df.groupby(['year', pd.Grouper(freq='2MS', key='date')])['score'].mean()
print (df1)
year  date      
2020  2020-01-01    15
2021  2021-03-01    35
Name: score, dtype: int64

如果单独处理月份或月份期间，则不存在日期时间的值会出现错误，例如：

df['date'] = pd.to_datetime(df[['month', 'year']].assign(day=1))

df2 = df.groupby( pd.Grouper(freq='2MS', key='date'))['score'].mean()
print (df2)
date
2020-01-01    15.0
2020-03-01     NaN
2020-05-01     NaN
2020-07-01     NaN
2020-09-01     NaN
2020-11-01     NaN
2021-01-01     NaN
2021-03-01    35.0
Freq: 2MS, Name: score, dtype: float64

df['per'] = pd.to_datetime(df[['month', 'year']].assign(day=1)).dt.to_period('m')
df3 = df.set_index('per').groupby( pd.Grouper(freq='2M'))['score'].mean()
print (df3)
per
2020-01    15.0
2020-03     NaN
2020-05     NaN
2020-07     NaN
2020-09     NaN
2020-11     NaN
2021-01     NaN
2021-03    35.0
Freq: 2M, Name: score, dtype: float64

对于 remove NaNs 是可能的使用：

df2 = df2.dropna()
df3 = df3.dropna()

python - 如何每两个月按数据框分组

1 回答 1

Related

Reference