r - Caret 的 train 和confusionMatrix 函数

Question

我试图通过遵循 Max Khun 的 Applied Predictive Modeling 书来了解插入符号的工作原理，但无法理解插入符号的混淆矩阵函数是如何工作的。

我使用 glmnet 训练了训练数据集 (training[, fullSet])，它有 8190 行和 1073 列，如下所示：

glmnGrid <- expand.grid(alpha = c(0,  .1,  .2, .4, .6, .8, 1),
                    lambda = seq(.01, .2, length = 40))

ctrl <- trainControl(method = "cv", 
                 number = 10,
                 summaryFunction = twoClassSummary,
                 classProbs = TRUE,
                 index = list(TrainSet = pre2008),
                 savePredictions = TRUE)

glmnFit <- train(x = training[,fullSet], 
             y = training$Class,
             method = "glmnet",
             tuneGrid = glmnGrid,
             preProc = c("center", "scale"),
             metric = "ROC",
             trControl = ctrl)

然后，我从拟合中打印了混淆矩阵：

glmnetCM <- confusionMatrix(glmnFit, norm = "none")

当我查看混淆矩阵时，我得到了以下结果：

               Reference
Prediction     successful unsuccessful
  successful          507          208
  unsuccessful         63          779

但是，我不明白为什么混淆表只有 1757 个观察值（1757 = 507 + 208 + 63 + 779），因为插入符号的confusionMatrix.train 文档说“当训练用于调整模型时，它会跟踪混淆矩阵单元保留样本的条目。” 由于训练数据集有 8190 行，我使用了 10 倍的 CV，所以我认为混淆矩阵应该基于 819 个数据点（819 = 8190 / 10），事实并非如此。

显然我不完全理解插入符号的 trainControl 或 train 是如何工作的。有人可以解释我误解了什么吗？

非常感谢你的帮助。

李英进

score 3 · Accepted Answer

问题在于控制参数。您正在使用method = "cv"，number = 10但您也在指定将用于拟合模型的精确重采样（通过index参数）。我假设这是书中的授权数据。在第 12 章中，我们描述了数据拆分方案，其中pre2008向量表示 8,190 个样本中的 6,633 个将用于训练。在模型调整期间留下了 1,557 个：

> dim(training)
[1] 8190 1785
> length(pre2008)
[1] 6633
> 8190-6633
[1] 1557

对非pre2008样本的预测就是您在表中看到的。如果你想重现我们所拥有的，第 312 页有正确的语法：

ctrl <- trainControl(method = "LGOCV",
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE,
                     index = list(TrainSet = pre2008))

如果你只想做 10 倍的 CV，那就摆脱index争论。

tl;dr控制函数说 10 倍 CV，但index参数说应该使用 1,557 个样本中的一个保留。

最大限度

r - Caret 的 train 和confusionMatrix 函数

1 回答 1

Related

Reference