r - 在 DAAG 包中的交叉验证线性回归中查看哪些值

Question

我对包含 151 个变量和 161 个观察值的数据集执行了以下操作：-

> library(DAAG)
> fit <- lm(RT..seconds.~., data=cadets)
> cv.lm(df = cadets, fit, m = 10)

并得到以下结果：-

fold 1 
Observations in test set: 16 
                  7     11     12      24     33    38      52     67     72
Predicted      49.6   44.1   26.4    39.8   53.3 40.33    47.8   56.7   58.5
cvpred        575.0 -113.2  640.7 -1045.8  876.7 -5.93  2183.0 -129.7  212.6
RT..seconds.   42.0   44.0   44.0    45.0   45.0 46.00    49.0   56.0   58.0
CV residual  -533.0  157.2 -596.7  1090.8 -831.7 51.93 -2134.0  185.7 -154.6

我想要做的是将预测结果与实际实验结果进行比较，因此我可以绘制两者的图表以显示它们的相似程度。我假设我会通过使用 Predicted 行中的值作为我的预测结果而不是 cvpred 来做到这一点是正确的？

我只问这个问题，因为当我在 caret 包中执行相同的操作时，预测值和观察值彼此之间的差异更大：-

图书馆（插入符号）ctrl <- trainControl（方法=“cv”，savePred=T，classProb=T）mod <-train（RT..seconds.~.，数据=学员，方法=“lm”，trControl=ctrl） mod$pred

        pred obs rowIndex .parameter Resample
1      141.2  42        6       none   Fold01
2     -504.0  42        7       none   Fold01
3     1196.1  44       16       none   Fold01
4       45.0  45       27       none   Fold01
5      262.2  45       35       none   Fold01
6      570.9  52       58       none   Fold01
7     -166.3  53       61       none   Fold01
8    -1579.1  59       77       none   Fold01
9     2699.0  60       79       none   Fold01

该模型不应该如此不准确，因为我最初从 1664 个变量开始，通过使用随机森林减少了它，因此只使用了变量重要性大于 1 的变量，这大大减少了我的数据集从 162 * 1664 到 162 * 151。

如果有人可以向我解释一下，我将不胜感激，谢谢

score 5 · Accepted Answer

I think there are few areas of confusion here, let me try to clear the up for you.

The "Predicted" section from cv.lm does not correspond to results from crossvalidaiton. If you're interested with crossvalidaiton then you need to look at your "cvpred" results -- "Predicted" corresponds to predictions from the model fit using all of your data.

The reason that there is a such a large difference between your predictions and your cvpredictions is likely because your final model is overfitting which should illustrate why crossvalidation is so important.

I believe that you are fitting your cv.lm model incorrectly. I've never used the package but I think you want to pass in something like cv.lm(df = cadets, RT..seconds.~., m = 10) rather than your fit object. I'm not sure why you see such a large difference between your cvpred and Predicted options in the example above, but these results tell me that passing in a model will lead to using a model that was fit on all of the data for each CV fold:

library(DAAG)
fit <- lm(Sepal.Length ~ ., data=iris)
mod1 <- cv.lm(df=iris,fit,m=10)
mod2 <- cv.lm(df=iris,Sepal.Length ~ .,m=10)
> sqrt(mean((mod1$cvpred - mod1$Sepal.Length)^2))
[1] 0.318
> sqrt(mean((mod2$cvpred - mod2$Sepal.Length)^2))
[1] 5.94
> sqrt(mean((mod1$cvpred - mod1$Predicted)^2))
[1] 0.0311
> sqrt(mean((mod2$cvpred - mod2$Predicted)^2))
[1] 5.94

The reason that there is such a difference between your caret results is because you were looking at the "Predicted" section. "cvpred" should line up closely with caret (although make sure to make indices on your cv results) and if you want to line up the "Predicted" results with caret you will need to get your predictions using something like predict(mod,cadets).

r - 在 DAAG 包中的交叉验证线性回归中查看哪些值

1 回答 1

Related

Reference