我有一个 2043 小时的时间序列。我的目标是通过高斯过程回归 (GPR) 模型训练和测试预测模型,包括其不确定性估计。我在应用 R 包caret + kernlab::train( )模型的联合的准时预测方面取得了很好的结果。我如何使用这些 R 包估计它的预测区间?
我的 .csv 数据可通过三个 Dropbox 链接获得:
目的是使用“ST3”和“ST4_lag1”回归量预测“ST4_lead1”。它们是我们在 data.frames 上的列。
从现在开始,我将展示我的代码。第一个窗口代码对我们双方都是通用的。您可以完全运行它。
# Forecasting dataseries: caret + kernlab packages
################ Load R packages
library(readr)
library(tidyverse)
library(caret)
library(kernlab)
library(hydroGOF)
library(lubridate)
################ Load data
setwd("C:/...") # Set the folder you saved the downloaded .csv data
data_train_emd <- read.csv("data_train_emd.csv", sep = ';')
data_test_emd <- read.csv("data_test_emd.csv", sep = ';')
final_df_emd <- read.csv("final_df_emd.csv", sep = ';')
################ GaussprRadial model Preprocess
set.seed(111)
cvCtrl <- trainControl(
method = "repeatedcv", # Cross-correlation preprocess
repeats = 1,
number = 3,
allowParallel = TRUE,
verboseIter = TRUE,
savePredictions = "final")
################ Set gaussprRadial grid for tuning parameters
grid_gaussprRadial <- expand.grid(sigma = c(1*1e-3, 5*1e-3))
################ Implement Gaussian Process
################ Train
set.seed(111)
attach(data_train_emd)
system.time(Fitting_model <- train(ST4_lead1 ~ ST3 + ST4_lag1, # Predict "ST4_lead1" based on regressors "ST3" and "ST4_lag1"
method ="gaussprRadial", # Radial Basis kernel function
trControl = cvCtrl,
data = data_train_emd,
metric = "MAE", # Using "MAE" as fitting parameters, since I'm focusing on low data
preProcess = c("center", "scale"), # Centering and scaling data Preprocess
tuneGrid = grid_gaussprRadial,
maxit = 1000,
linout = 1)) # 1 for regression model
#The training process lasts 14 seconds on my machine.
################ Test
attach(final_df_emd)
forecasted <- data_test_emd %>%
mutate(Qpred = predict(Fitting_model, newdata = data_test_emd),
Time = final_df_emd %>% filter(Type == "Test") %>% pull(Time))
################ Tidy training results
################ View metrics for training process
results_gaussprRadial <- Fitting_model$resample %>% mutate(Model = "")
colnames(results_gaussprRadial) = c("RMSE", "R2", "MAE", "Sigma", "Model")
results_gaussprRadial %>%
select(RMSE, R2, MAE, Model) %>%
gather(a, b, -Model) %>%
ggplot(aes(Model, b, color = Model, fill = Model)) +
geom_boxplot(alpha = 0.3, show.legend = FALSE) +
facet_wrap(~ a, scales = "free") +
labs(x = "Train simulation performances", y = NULL)
QObs_Tr = Fitting_model$pred$obs # Observed data to be trained
Qsim = Fitting_model$pred$pred # Simulated data in the training process
Residual_Tr = QObs_Tr - Qsim
train_obs_calc <- data.frame(QObs_Tr, Qsim, Residual_Tr)
ggplot(data = train_obs_calc, aes(x = QObs_Tr, y = Qsim)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "black") +
labs(title = "Error Scatterplot", x = "Q(m³/s)", y = "Q(m³/s)")
我们已经完成了常见的培训和测试过程以及获得测试结果。
在第二个窗口代码中,我们将从预测对象中提取一些测试结果。顺便说一句,我希望预测会为我们带来所需的预测间隔,我需要帮助。
################ Tidy testing results
################ Set time period
N.hours <- as.numeric(ymd_h("2016-10-23 10") - ymd_h("2016-10-02 05"))*24 # Set date limits
Time = ymd_h("2016-10-02 05") + hours(0:N.hours)
################ View testing process results
QObs_Test = forecasted$ST4_lead1 # Observed data to be tested
QPred = forecasted$Qpred # Simulated data in the testing process
Residual_Test = QObs_Test - QPred
test_obs_calc <- data.frame(Time, QObs_Test, QPred, Residual_Test)
ggplot(data = test_obs_calc, aes(x = Time, y = QObs_Test)) +
geom_line() +
geom_point(aes(y = QPred), color = "coral") +
labs(title = "Testing Results", x = "", y = "Q(m³/s)")
它为我们带来了测试结果图,如下所示。实线是观察到的数据,而红点是预测的估计值。
最后,在第三个窗口代码中,我们只计算回归的指标:
################ Metrics performance
################ Train
gof(QObs_Tr, Qsim)
################ Test
gof(QObs_Test, QPred)
如我们所见,对于训练过程,R² = 0.76 和 MAE = 1.06;而对于测试对象,R² = 0.80 和 MAE = 1.36。这些对我们来说已经足够好了。
在 GPR 中,预测倒数基于每个预测点周围的高斯分布,由Rasmussen 和 Williams (2006)提供。我曾通过其他三种方式尝试过 GPR:仅使用kernlab::train( )、tgp::bgp( )和gplmr::train( )。这最后一个是由duckmayr亲切地建议的。对于所有这些,我找到了预测区间。然而,“ caret + kernlab::train() ”的联合使用(如我的代码所示)返回了更好的准时预测。我不知道为什么。
正如我在第一个和第二个窗口代码之间提到的,我可能期望从“预测”对象中,我们可以从每个预测点提取平均数据并添加(或减去)其标准偏差的 2 倍,允许y = average + - 2*标准差。但是,我不知道如何提取这个平均值,甚至是每个预测点的标准偏差。
简而言之,如何使用 caret + kernlab 包的联合使用来获得高斯过程回归的预测区间?