我希望比较多个模型运行的输出,计算这些值:
- 本期收入与上期差异
- 当期实际收入与预测当期收入之间的差异
我已经尝试过多索引,并且怀疑答案在于通过一些创造性的 shift() 的方向。但是,恐怕我已经通过随意应用各种 pivot/melt/groupby 实验解决了这个问题。也许你可以帮我弄清楚如何把它变成:
import pandas as pd
ids = [1,2,3] * 5
year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']
revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]
change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]
d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}
df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
print df
ids year run revenue
0 1 2013 actual 10
1 2 2013 actual 20
2 3 2013 actual 20
3 1 2014 forecast 30
4 2 2014 forecast 50
5 3 2014 forecast 90
6 1 2014 actual 10
7 2 2014 actual 40
8 3 2014 actual 50
9 1 2015 forecast 120
10 2 2015 forecast 210
11 3 2015 forecast 150
12 1 2015 actual 130
13 2 2015 actual 100
14 3 2015 actual 190
....进入这个:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NA NA
1 2 2013 actual 20 NA NA
2 3 2013 actual 20 NA NA
3 1 2014 forecast 30 20 NA
4 2 2014 forecast 50 30 NA
5 3 2014 forecast 90 70 NA
6 1 2014 actual 10 0 -20
7 2 2014 actual 40 20 -10
8 3 2014 actual 50 30 -40
9 1 2015 forecast 120 90 NA
10 2 2015 forecast 210 160 NA
11 3 2015 forecast 150 60 NA
12 1 2015 actual 130 120 30
13 2 2015 actual 100 60 -110
14 3 2015 actual 190 140 40
编辑——我对此非常接近:
df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
df['chg_from_prev_year'] = df['revenue'] - df['prev_year']
df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']
唯一错过的(如预期的)是 2014 年预测和 2013 年实际之间的比较。我可以在数据集中复制 2013 年的运行,计算 2014 年预测的 chg_from_prev_year,并从最终数据框中隐藏/删除不需要的数据。