0

下面是执行以下操作的 R 代码: 1. 为线性回归生成数据(4 个预测变量,多元正态数据,基于相关矩阵) 2. 使用插入符号运行 10 倍交叉验证,提供汇总 R2 结果 *3。将所有折叠的预测值与实际值相关联,然后将其平方以获得交叉验证的 R2——这是下面代码中的变量“ar2”。

*所以我的问题是上面的#3:为什么插入符号不只计算这个?相反,它报告每个折叠内的 R2,解释折叠间 R2 的可变性等。但如果我想知道基于交叉折叠的整体样本外预测,似乎上面的#3 更直接。


# cross-validated linear regression
library(MASS)
library(caret)

# first generate random normal data
sigma <- matrix(c( 1,  .35, .20, .10, .25, 
                  .35, 1  , .15, .30, .30,
                  .20, .15,  1 , .40, .20,
                  .10, .30, .40, 1  , .35,
                  .25, .30, .20, .35,   1), ncol=5)

d <- mvrnorm(n = 100, rep(0, 5), sigma)

# label variables here
colnames(d) <- c(paste0("x", 1:4),"y")
# look at top of data set
head(d)

# generate means and correlations
apply(d,2,mean)
cor(d)
d <- as.data.frame(d)

# what if we used the whole sample, no cross-validation?
full <- lm(y ~ ., data = d)
summary(full)

# now let's look at cross-validated prediction

data_ctrl <- trainControl(method = "cv", number = 10, savePredictions="all")     # folds for cross-validation
model_caret <- train(y ~ .,   # model to fit - the dot means include all x's
                     data = d,                        
                     trControl = data_ctrl,              # include the folds above
                     method = "lm")                      # specify linear regression                
model_caret           # results from cross-validation
# look at predictions for each fold
model_caret$resample
# summarized results
model_caret$results
# all data put into final model
summary(model_caret) 

# what is the r2 between observed and predicted values?
# get the predicted values across folds
a <- model_caret$pred
# correlate actual and predicted values
ar2 <- cor(a[,1],a[,2])^2
ar2

# ...we can compare this r2 (ar2) from cross-validation to the r2 from the full model
# and get a direct sense of how r2 goes down under cross validation...right?
4

0 回答 0