下面是执行以下操作的 R 代码: 1. 为线性回归生成数据(4 个预测变量,多元正态数据,基于相关矩阵) 2. 使用插入符号运行 10 倍交叉验证,提供汇总 R2 结果 *3。将所有折叠的预测值与实际值相关联,然后将其平方以获得交叉验证的 R2——这是下面代码中的变量“ar2”。
*所以我的问题是上面的#3:为什么插入符号不只计算这个?相反,它报告每个折叠内的 R2,解释折叠间 R2 的可变性等。但如果我想知道基于交叉折叠的整体样本外预测,似乎上面的#3 更直接。
# cross-validated linear regression
library(MASS)
library(caret)
# first generate random normal data
sigma <- matrix(c( 1, .35, .20, .10, .25,
.35, 1 , .15, .30, .30,
.20, .15, 1 , .40, .20,
.10, .30, .40, 1 , .35,
.25, .30, .20, .35, 1), ncol=5)
d <- mvrnorm(n = 100, rep(0, 5), sigma)
# label variables here
colnames(d) <- c(paste0("x", 1:4),"y")
# look at top of data set
head(d)
# generate means and correlations
apply(d,2,mean)
cor(d)
d <- as.data.frame(d)
# what if we used the whole sample, no cross-validation?
full <- lm(y ~ ., data = d)
summary(full)
# now let's look at cross-validated prediction
data_ctrl <- trainControl(method = "cv", number = 10, savePredictions="all") # folds for cross-validation
model_caret <- train(y ~ ., # model to fit - the dot means include all x's
data = d,
trControl = data_ctrl, # include the folds above
method = "lm") # specify linear regression
model_caret # results from cross-validation
# look at predictions for each fold
model_caret$resample
# summarized results
model_caret$results
# all data put into final model
summary(model_caret)
# what is the r2 between observed and predicted values?
# get the predicted values across folds
a <- model_caret$pred
# correlate actual and predicted values
ar2 <- cor(a[,1],a[,2])^2
ar2
# ...we can compare this r2 (ar2) from cross-validation to the r2 from the full model
# and get a direct sense of how r2 goes down under cross validation...right?