python - 计算熊猫数据框中与上一年/预测的差异

Question

我希望比较多个模型运行的输出，计算这些值：

本期收入与上期差异
当期实际收入与预测当期收入之间的差异

我已经尝试过多索引，并且怀疑答案在于通过一些创造性的 shift() 的方向。但是，恐怕我已经通过随意应用各种 pivot/melt/groupby 实验解决了这个问题。也许你可以帮我弄清楚如何把它变成：

import pandas as pd

ids = [1,2,3] * 5
year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']

revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]

change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]

d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}

df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
print df

    ids  year       run  revenue
0     1  2013    actual       10
1     2  2013    actual       20
2     3  2013    actual       20
3     1  2014  forecast       30
4     2  2014  forecast       50
5     3  2014  forecast       90
6     1  2014    actual       10
7     2  2014    actual       40
8     3  2014    actual       50
9     1  2015  forecast      120
10    2  2015  forecast      210
11    3  2015  forecast      150
12    1  2015    actual      130
13    2  2015    actual      100
14    3  2015    actual      190

....进入这个：

    ids  year       run  revenue chg_from_prev_year chg_from_forecast
0     1  2013    actual       10                 NA                NA
1     2  2013    actual       20                 NA                NA
2     3  2013    actual       20                 NA                NA
3     1  2014  forecast       30                 20                NA
4     2  2014  forecast       50                 30                NA
5     3  2014  forecast       90                 70                NA
6     1  2014    actual       10                  0               -20
7     2  2014    actual       40                 20               -10
8     3  2014    actual       50                 30               -40
9     1  2015  forecast      120                 90                NA
10    2  2015  forecast      210                160                NA
11    3  2015  forecast      150                 60                NA
12    1  2015    actual      130                120                30
13    2  2015    actual      100                 60              -110
14    3  2015    actual      190                140                40

编辑——我对此非常接近：

df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
df['chg_from_prev_year'] = df['revenue'] - df['prev_year']

df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']

唯一错过的（如预期的）是 2014 年预测和 2013 年实际之间的比较。我可以在数据集中复制 2013 年的运行，计算 2014 年预测的 chg_from_prev_year，并从最终数据框中隐藏/删除不需要的数据。

score 2 · Accepted Answer

首先要获得与上一年相比的变化，请对每个组进行转变：

In [11]: g = df.groupby(['ids', 'run'])

In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift())

下一部分比较复杂，我认为你需要pivot_table为下一部分做一个：

In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run')

In [14]: df1
Out[14]:
run       actual  forecast
ids year
1   2013      10       NaN
    2014      10        30
    2015     130       120
2   2013      20       NaN
    2014      40        50
    2015     100       210
3   2013      20       NaN
    2014      50        90
    2015     190       150

In [15]: g1 = df1.groupby(level='ids', as_index=False)

In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast'])

In [17]: out_by  # hello levels bug, fixed in 0.13/master... yesterday :)
Out[17]:
ids  ids  year
1    1    2013    NaN
          2014    -20
          2015     10
2    2    2013    NaN
          2014    -10
          2015   -110
3    3    2013    NaN
          2014    -40
          2015     40
dtype: float64

这是您想要的结果，但格式不正确（如果您不太大惊小怪，请参见下文 [31]）...以下内容似乎有点 hack（委婉地说），但这里有：

In [21]: df2 = df.set_index(['ids', 'year', 'run'])

In [22]: out_by.index = out_by.index.droplevel(0)

In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue'])

In [24]: out_by_df['run'] = 'forecast'

In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue']

我们完成了...

In [26]: df2.reset_index()
Out[26]:
    ids  year       run  revenue  chg_from_prev_year  chg_from_forecast
0     1  2013    actual       10                 NaN                NaN
1     2  2013    actual       20                 NaN                NaN
2     3  2013    actual       20                 NaN                NaN
3     1  2014  forecast       30                 NaN                -20
4     2  2014  forecast       50                 NaN                -10
5     3  2014  forecast       90                 NaN                -40
6     1  2014    actual       10                   0                NaN
7     2  2014    actual       40                  20                NaN
8     3  2014    actual       50                  30                NaN
9     1  2015  forecast      120                  90                 10
10    2  2015  forecast      210                 160               -110
11    3  2015  forecast      150                  60                 40
12    1  2015    actual      130                 120                NaN
13    2  2015    actual      100                  60                NaN
14    3  2015    actual      190                 140                NaN

注意：我认为前 6 个结果chg_from_prev_year应该是 NaN。

但是，我认为您最好将其作为支点：

In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run')

In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values

In [33]: df3
Out[33]:
          revenue            chg_from_prev_year            chg_from_forecast
run        actual  forecast              actual  forecast
ids year
1   2013       10       NaN                 NaN       NaN                NaN
    2014       10        30                   0       NaN                -20
    2015      130       120                 120        90                 10
2   2013       20       NaN                 NaN       NaN                NaN
    2014       40        50                  20       NaN                -10
    2015      100       210                  60       160               -110
3   2013       20       NaN                 NaN       NaN                NaN
    2014       50        90                  30       NaN                -40
    2015      190       150                 140        60                 40

python - 计算熊猫数据框中与上一年/预测的差异

1 回答 1

Related

Reference