我正在使用 statsmodel 包拟合 ARIMAX 模型以生成时间序列预测,我选择 ARIMAX 的原因之一是因为理论上线性形式可以很容易地解释模型输入和系数值如何生成预测。我试图分离出每个外生变量、AR 和 MA 参数对通过使用以下方法重建预测而生成的预测的影响:
- 外生参数的系数
- 我正在预测的时间段的第一个差分外生参数值
- AR系数
- 滞后的一阶差分 endog 变量(与 AR 系数一起使用)
- MA 系数
- 先前运行的残差(与 MA 系数一起使用)
将系数乘以值并将其添加到前一个 Y 值可以让我接近模型生成的预测值,但它总是稍微偏离(太远而不能成为舍入误差)。我是否遗漏了预测的某些组成部分,误解了 ARIMAX 如何生成预测,或者只是我的数学或代码中有错误?
import pandas as pd
import numpy as np
import statsmodels.api as sm
# initialize random df
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['y','x1','x2','x3'])
# set aside last row as prediction set, everything else as train
df_train = df.iloc[0:-1]
df_pred = df.iloc[-1]
# train arima 2,1,2 on df_train
model = sm.tsa.statespace.SARIMAX(df_train['y'],
df_train[['x1','x2','x3']],
order=[2,1,2],
enforce_stationarity=False,
enforce_invertibility=False)
# generate prediction using exog values for prediction row
predicted_y = model.fit().get_forecast(1, exog = df_pred[['x1','x2','x3']]).predicted_mean.to_list()[0]
# use model parameters to decompose linear contributors to prediction
# find the model coefficients
coef_frame = model.fit().params.reset_index().set_axis(['parameter','coefficient'],axis=1)
# first difference the df to account for ARIMA 1 parameter
df_1st_diff = df - df.shift(1)
# reshape exog vars from last row of 1st diff - these are values prediction should be based on
pred_1st_diff_exog = df_1st_diff.iloc[-1][['x1','x2','x3']].transpose().reset_index().set_axis(['parameter','value'],axis=1)
# merge coefficients with exog variable values used to generate prediction
param_frame = coef_frame.merge(pred_1st_diff_exog,how='left')
# extract residuals for passing to MA
resids = model.fit().resid.to_list()
# extract the ar and ma orders for the given model
ar_order = model.fit().model_orders['ar']
ma_order = model.fit().model_orders['ma']
# gather appropriate values for ar (lagged 1st diff ys) and ma (lagged residuals), plug into param frame values
# iterate through numbers up to ar model order and extract values
for i in range(1,ar_order+1):
# the ar1 lag should be the the second to last row of first diff frame, cos last row corresponds to
# to the prediction row
arlag = df_1st_diff.iloc[0-(i+1)]['y']
# lag again to get the previous weeks lag value
ar_string = 'ar.L'+str(i)
param_frame.loc[param_frame['parameter']==ar_string,'value'] = arlag
for i in range(1,ma_order+1):
malag = resids[0-i]
ma_string = 'ma.L'+str(i)
param_frame.loc[param_frame['parameter']==ma_string,'value'] = malag
# multiply coefficient by value to get impact of each parameter on prediction
param_frame['impact'] = param_frame['coefficient']*param_frame['value']
# take the sum of all of the impacts used to generate prediction
impact_sum = param_frame['impact'].sum()
# store the last y and the last first differenced y
last_y = df_train['y'].iloc[-1]
last_y_1st_diff = df_1st_diff['y'].iloc[-2]
print('Predicted Y: {0}\nSum of impacts: {1}\nLast Y: {2}\n1st diff of last y: {3}'.format(
predicted_y,impact_sum,last_y,last_y_1st_diff))