r - 如何在 CARET 中自定义模型以执行 PLS-[Classifer] 两步分类模型？

Question

这个问题是这里同一线程的延续。以下是本书中的一个最小工作示例：

Wehrens R. Chemometrics 在自然科学和生命科学中使用 R 多元数据分析。第 1 版。海德堡；纽约：斯普林格。2011 年。（第 250 页）。

该示例取自这本书及其包ChemometricsWithR。它突出了使用交叉验证技术建模时的一些缺陷。

目标：
一种交叉验证的方法，使用相同的重复 CV 集执行PLS通常遵循的已知策略，LDA或类似逻辑回归、SVM、C5.0、CART 的表亲，具有caret打包的精神。因此，每次调用等待分类器之前都需要 PLS，以便对 PLS分数空间进行分类，而不是对观察结果本身进行分类。caret 包中最接近的方法是在PCA使用任何分类器建模之前作为预处理步骤。下面是一个 PLS-LDA 程序，只有一个交叉验证来测试分类器的性能，没有 10 倍 CV 或任何重复。下面的代码取自上述书中，但进行了一些更正，否则会引发错误：

library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var

odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test

prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)

Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))

tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test

prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])

predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)

library(caret)    
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret

输出：

Confusion Matrix and Statistics

          Reference
Prediction bph control pca
   bph       4       1   9
   control   1      35   7
   pca      34       4  68

Overall Statistics

               Accuracy : 0.6564          
                 95% CI : (0.5781, 0.7289)
    No Information Rate : 0.5153          
    P-Value [Acc > NIR] : 0.0001874       

                  Kappa : 0.4072          
 Mcnemar's Test P-Value : 0.0015385       

Statistics by Class:

                     Class: bph Class: control Class: pca
Sensitivity             0.10256         0.8750     0.8095
Specificity             0.91935         0.9350     0.5190
Pos Pred Value          0.28571         0.8140     0.6415
Neg Pred Value          0.76510         0.9583     0.7193
Prevalence              0.23926         0.2454     0.5153
Detection Rate          0.02454         0.2147     0.4172
Detection Prevalence    0.08589         0.2638     0.6503
Balanced Accuracy       0.51096         0.9050     0.6643

但是，混淆矩阵与书中的不匹配，无论如何书中的代码确实坏了，但是这里的这个对我有用！

注：
虽然这只是一份 CV，但目的是先就这个方法达成一致，sd并将mean训练集应用到测试集上，PLUS 转化为基于特定 PC 数量的 PLS 分数ncomp。我希望这发生在插入符号中的每一轮简历中。如果作为代码的方法在这里是正确的，那么它可以作为一个最小工作示例的良好开端，同时修改 caret 包的代码。

旁注：
缩放和居中可能会非常混乱，我认为 R 中的一些 PLS 函数在内部进行缩放，有或没有居中，我不确定，所以在插入符号中构建自定义模型时应小心避免缺少或多个缩放或居中（我对这些东西保持警惕）。

多重居中/缩放的危险
下面的代码只是为了展示多重居中/缩放如何改变数据，这里只显示了居中，但同样的问题也适用于缩放。

set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)

输出：

xNotCentered    xCentered1    xCentered2    xCentered3 
 2.035540e+00  1.897798e-16 -5.603699e-18 -5.332377e-18

如果我在本课程的某个地方遗漏了什么，请发表评论。谢谢。

score 8 · Accepted Answer

如果您想将这些类型的模型与相匹配caret，则需要在 CRAN 上使用最新版本。创建了最后一次更新，以便人们可以使用他们认为合适的非标准模型。

我下面的方法是联合拟合 PLS 和其他模型（我在下面的示例中使用随机森林）并同时调整它们。因此，对于每个折叠，使用ncomp和的 2D 网格mtry。

“技巧”是将 PLS 加载附加到随机森林对象，以便它们可以在预测期间使用。这是定义模型的代码（仅限分类）：

 modelInfo <- list(label = "PLS-RF",
              library = c("pls", "randomForest"),
              type = "Classification",
              parameters = data.frame(parameter = c('ncomp', 'mtry'),
                                      class = c("numeric", 'numeric'),
                                      label = c('#Components', 
                                                '#Randomly Selected Predictors')),
              grid = function(x, y, len = NULL) {
                grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
                            mtry = 1:len)
                grid <- subset(grid, mtry <= ncomp)
                },
              loop = NULL,
              fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                     ## First fit the pls model, generate the training set scores,
                     ## then attach what is needed to the random forest object to 
                     ## be used later
                     pre <- plsda(x, y, ncomp = param$ncomp)
                     scores <- pls:::predict.mvr(pre, x, type = "scores")
                     mod <- randomForest(scores, y, mtry = param$mtry, ...)
                     mod$projection <- pre$projection
                     mod
                   },
                   predict = function(modelFit, newdata, submodels = NULL) {       
                     scores <- as.matrix(newdata)  %*% modelFit$projection
                     predict(modelFit, scores)
                   },
                   prob = NULL,
                   varImp = NULL,
                   predictors = function(x, ...) rownames(x$projection),
                   levels = function(x) x$obsLevels,
                   sort = function(x) x[order(x[,1]),])

这是调用train：

 library(ChemometricsWithR)
 data(prostate)

 set.seed(1)
 inTrain <- createDataPartition(prostate.type, p = .90)
 trainX <-prostate[inTrain[[1]], ]
 trainY <- prostate.type[inTrain[[1]]]
 testX <-prostate[-inTrain[[1]], ]
 testY <- prostate.type[-inTrain[[1]]]

 ## These will take a while for these data
 set.seed(2)
 plsrf <- train(trainX, trainY, method = modelInfo,
                preProc = c("center", "scale"),
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

 ## How does random forest do on its own?
 set.seed(2)
 rfOnly <- train(trainX, trainY, method = "rf",
                tuneLength = 10,
                trControl = trainControl(method = "repeatedcv",
                                         repeats = 5))

只是为了踢球，我得到了：

 > getTrainPerf(plsrf)
   TrainAccuracy TrainKappa method
 1     0.7940423    0.65879 custom
 > getTrainPerf(rfOnly)
   TrainAccuracy TrainKappa method
 1     0.7794082  0.6205322     rf

和

 > postResample(predict(plsrf, testX), testY)
  Accuracy     Kappa 
 0.7741935 0.6226087 
 > postResample(predict(rfOnly, testX), testY)
  Accuracy     Kappa 
 0.9032258 0.8353982

最大限度

score 4 · Accepted Answer

根据 Max 的宝贵意见，我觉得有必要有IRIS裁判，它以分类着称，更重要的是Species结果有两个以上的类，这将是一个很好的数据集来测试插入符号中的 PLS-LDA 自定义模型：

data(iris)
names(iris)
head(iris)
dim(iris) # 150x5
set.seed(1)
inTrain <- createDataPartition(y = iris$Species,
                               ## the outcome data are needed
                               p = .75,
                               ## The percentage of data in the
                               ## training set
                               list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Iris
## that belong in the training set.
training <- iris[ inTrain,] # 114
testing <- iris[-inTrain,] # 36

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     classProbs = TRUE)
set.seed(2)
plsFitIris <- train(Species ~ .,
                   data = training,
                   method = "pls",
                   tuneLength = 4,
                   trControl = ctrl,
                   preProc = c("center", "scale"))
plsFitIris
plot(plsFitIris)


set.seed(2)
plsldaFitIris <- train(Species ~ .,
                      data = training,
                      method = modelInfo,
                      tuneLength = 4,
                      trControl = ctrl,
                      preProc = c("center", "scale"))

plsldaFitIris
plot(plsldaFitIris)

现在比较两个模型：

getTrainPerf(plsFitIris)
  TrainAccuracy TrainKappa method
1     0.8574242  0.7852462    pls
getTrainPerf(plsldaFitIris)
  TrainAccuracy TrainKappa method
1      0.975303  0.9628179 custom
postResample(predict(plsFitIris, testing), testing$Species)
Accuracy    Kappa 
   0.750    0.625 
postResample(predict(plsldaFitIris, testing), testing$Species)
 Accuracy     Kappa 
0.9444444 0.9166667

因此，最终出现了预期的差异，以及指标的改进。因此，这将支持 Max 的观点，即由于贝叶斯plsda函数的概率方法导致的两类问题都导致相同的结果。

score 3 · Accepted Answer

您需要将 CV 包裹在 PLS 和 LDA 周围。
是的，两者都plsr以lda自己的方式集中数据
我仔细研究了一下caret::preProcess ()：正如现在定义的那样，您将无法使用 PLS 作为预处理方法，因为它是受监督的，但caret::preProcess ()仅使用无监督方法（无法交出因变量）。这可能会使修补相当困难。
因此，在插入符号框架中，您需要使用自定义模型。

score 0 · Accepted Answer

如果场景是定制一个PLS-LDA类型的模型，根据Max（CARET的维护者）提供的代码，这段代码有些不正确，但我没弄明白，因为我用的是Sonar数据集在caret小插图中设置相同，并尝试method="pls"使用以下 PLS-LDA 自定义模型再次重现结果，结果完全一致甚至与最后一个数字相同，这是荒谬的。对于基准测试，需要一个已知的数据集（我认为这里适合虹膜数据集的交叉验证 PLS-LDA，因为它以这种类型的分析而闻名，并且应该在某个地方对其进行交叉验证处理），一切除了有问题的代码之外，应该是相同的（set.seed（xxx）和K-CV repitition的编号），以便正确比较和判断下面的代码：

modelInfo <- list(label = "PLS-LDA",
                  library = c("pls", "MASS"),
                  type = "Classification",
                  parameters = data.frame(parameter = c("ncomp"),
                                          class = c("numeric"),
                                          label = c("#Components")),
                  grid = function(x, y, len = NULL) {
                    grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
                  },
                  loop = NULL,
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    ## First fit the pls model, generate the training set scores,
                    ## then attach what is needed to the lda object to 
                    ## be used later
                    pre <- plsda(x, y, ncomp = param$ncomp)
                    scores <- pls:::predict.mvr(pre, x, type = "scores")
                    mod <- lda(scores, y, ...)
                    mod$projection <- pre$projection
                    mod
                  },
                  predict = function(modelFit, newdata, submodels = NULL) {       
                    scores <- as.matrix(newdata)  %*% modelFit$projection
                    predict(modelFit, scores)$class
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {       
                    scores <- as.matrix(newdata)  %*% modelFit$projection
                    predict(modelFit, scores)$posterior
                  },
                  varImp = NULL,
                  predictors = function(x, ...) rownames(x$projection),
                  levels = function(x) x$obsLevels,
                  sort = function(x) x[order(x[,1]),])

根据 Zach 的要求，以下代码用于插入符号，与CRANmethod="pls"上插入符号 vigenette 中的具体示例完全相同：

library(mlbench) # data set from here
data(Sonar)
dim(Sonar) # 208x60
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
                               ## the outcome data are needed
                               p = .75,
                               ## The percentage of data in the
                               ## training set
                               list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
training <- Sonar[ inTrain,] #157
testing <- Sonar[-inTrain,] # 51

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 3,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)
set.seed(108)
plsFitSon <- train(Class ~ .,
                data = training,
                method = "pls",
                tuneLength = 15,
                trControl = ctrl,
                metric = "ROC",
                preProc = c("center", "scale"))
plsFitSon
plot(plsFitSon) # might be slightly difference than what in the vignette due to radnomness

现在，下面的代码是使用有问题的自定义模型对 Sonar 数据进行分类的试点运行，PLS-LDA除了仅使用 PLS 的数字之外，它预计会得出任何数字：

set.seed(108)
plsldaFitSon <- train(Class ~ .,
                   data = training,
                   method = modelInfo,
                   tuneLength = 15,
                   trControl = ctrl,
                   metric = "ROC",
                   preProc = c("center", "scale"))

现在比较两个模型之间的结果：

getTrainPerf(plsFitSon)
   TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381    pls
getTrainPerf(plsldaFitSon)
   TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 custom

postResample(predict(plsFitSon, testing), testing$Class)
Accuracy    Kappa 
0.745098 0.491954 
postResample(predict(plsldaFitSon, testing), testing$Class)
Accuracy    Kappa 
0.745098 0.491954

所以，结果是完全一样的，这是不可能的。好像lda没有添加模型？

r - 如何在 CARET 中自定义模型以执行 PLS-[Classifer] 两步分类模型？

4 回答 4

Related

Reference