r - sbf() 是否使用 metric 参数来优化模型？

Question

ROC作为metric参数值传递给caretSBF函数

我们的目标是使用 ROC 汇总指标进行模型选择，同时运行通过过滤选择sbf()功能进行特征选择。

该BreastCancer数据集被用作从mlbench包到运行train()以及sbf()使用metric = "Accuracy"和使用的可重复示例metric = "ROC"

我们要确保sbf()采用and函数metric应用的参数来优化模型。为此，我们计划用函数来使用函数。该函数调用，并传递给。train()rfe()train()sbf()caretSBF$fittrain()caretSBFsbfControl

从输出来看，该metric参数似乎仅用于该部分，inner resampling而不用于该sbf部分，即对于outer resampling输出的，该metric参数未应用于train()and rfe()。

正如我们所使用caretSBF的 which uses train()，看来metric参数的范围仅限于train()，因此不会传递给sbf。

我们希望澄清是否sbf()使用metric参数来优化模型，即outer resampling？

这是我们在可重现示例上的工作，使用and显示train()使用metric参数，但我们不确定。AccuracyROCsbf

一、数据部分

  ## Loading required packages   
  library(mlbench)
  library(caret)

  ## Loading `BreastCancer` Dataset from *mlbench* package   
  data("BreastCancer")

  ## Data cleaning for missing values
  # Remove rows/observation with NA Values in any of the columns
  BrC1 <- BreastCancer[complete.cases(BreastCancer),] 

  # Removing Class and Id Column and keeping just Numeric Predictors
  Num_Pred <- BrC1[,2:10]

二、自定义汇总功能

定义fiveStats汇总函数

  fiveStats <- function(...) c(twoClassSummary(...),
                         defaultSummary(...))

三、火车部分

定义 trControl

  trCtrl <- trainControl(method="repeatedcv", number=10,
  repeats=1, classProbs = TRUE, summaryFunction = fiveStats)

火车 + 公制 = “准确度”

   set.seed(1)
   TR_acc <- train(Num_Pred,BrC1$Class, method="rf",metric="Accuracy",
   trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))

   TR_acc
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 615, 614, 614, 614, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9936532  0.9729798  0.9833333  0.9765772  0.9490311
   #   3     0.9936544  0.9729293  0.9791667  0.9750853  0.9457534
   #   4     0.9929957  0.9684343  0.9750000  0.9706948  0.9361373
   #   5     0.9922907  0.9684343  0.9666667  0.9677536  0.9295782
   # 
   # Accuracy was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2.

火车+公制=“ROC”

   set.seed(1)
   TR_roc <- train(Num_Pred,BrC1$Class, method="rf",metric="ROC",
   trControl = trCtrl,tuneGrid=expand.grid(.mtry=c(2,3,4,5)))
   TR_roc
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 615, 614, 614, 614, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9936532  0.9729798  0.9833333  0.9765772  0.9490311
   #   3     0.9936544  0.9729293  0.9791667  0.9750853  0.9457534
   #   4     0.9929957  0.9684343  0.9750000  0.9706948  0.9361373
   #   5     0.9922907  0.9684343  0.9666667  0.9677536  0.9295782
   # 
   # ROC was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 3.

四。编辑插入符号SBF

编辑 caretSBF 汇总函数

   caretSBF$summary <- fiveStats

五、SBF部分

定义 sbfControl

   sbfCtrl <- sbfControl(functions=caretSBF, 
   method="repeatedcv", number=10, repeats=1,
   verbose=T, saveDetails = T)

SBF + METRIC =“准确度”

   set.seed(1)
   sbf_acc <- sbf(Num_Pred, BrC1$Class,
   sbfControl = sbfCtrl,
   trControl = trCtrl, method="rf", metric="Accuracy")

   ## sbf_acc  
   sbf_acc

   # Selection By Filter
   # 
   # Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
   # 
   # Resampling performance:
   # 
   #     ROC  Sens   Spec Accuracy Kappa    ROCSD SensSD  SpecSD AccuracySD  KappaSD
   #  0.9931 0.973 0.9833   0.9766 0.949 0.006272 0.0231 0.02913    0.01226 0.02646
   # 
   # Using the training set, 9 variables were selected:
   #    Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
   # 
   # During resampling, the top 5 selected variables (out of a possible 9):
   #    Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
   # 
   # On average, 9 variables were selected (min = 9, max = 9)

   ## Class of sbf_acc
   class(sbf_acc)
   # [1] "sbf"

   ## Names of elements of sbf_acc
   names(sbf_acc)
   #  [1] "pred"         "variables"    "results"      "fit"          "optVariables"
   #  [6] "call"         "control"      "resample"     "metrics"      "times"       
   # [11] "resampledCM"  "obsLevels"    "dots"        

   ## sbf_acc fit element*  
   sbf_acc$fit
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 614, 614, 615, 615, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9933176  0.9706566  0.9833333  0.9751492  0.9460717
   #   5     0.9920034  0.9662121  0.9791667  0.9707801  0.9363708
   #   9     0.9914825  0.9684343  0.9708333  0.9693308  0.9327662
   # 
   # Accuracy was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2. 

   ##  Elements of sbf_acc fit  
   names(sbf_acc$fit)
   #  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
   #  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
   # [11] "finalModel"   "preProcess"   "trainingData" "resample"     "resampledCM" 
   # [16] "perfNames"    "maximize"     "yLimits"      "times"        "levels"      

   ## sbf_acc fit final Model
   sbf_acc$fit$finalModel

   # Call:
   #  randomForest(x = x, y = y, mtry = param$mtry) 
   #                Type of random forest: classification
   #                      Number of trees: 500
   # No. of variables tried at each split: 2
   # 
   #         OOB estimate of  error rate: 2.34%
   # Confusion matrix:
   #           benign malignant class.error
   # benign       431        13  0.02927928
   # malignant      3       236  0.01255230

   ## sbf_acc metric
   sbf_acc$fit$metric
   # [1] "Accuracy"

   ## sbf_acc fit best Tune*  
   sbf_acc$fit$bestTune
   #   mtry
   # 1    2

SBF + 公制 = "ROC"

   set.seed(1)
   sbf_roc <- sbf(Num_Pred, BrC1$Class,
   sbfControl = sbfCtrl,
   trControl = trCtrl, method="rf", metric="ROC")


   ## sbf_roc  
   sbf_roc

   # Selection By Filter
   # 
   # Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
   # 
   # Resampling performance:
   # 
   #     ROC  Sens   Spec Accuracy Kappa    ROCSD SensSD  SpecSD AccuracySD KappaSD
   #  0.9931 0.973 0.9833   0.9766 0.949 0.006272 0.0231 0.02913    0.01226 0.02646
   # 
   # Using the training set, 9 variables were selected:
   #    Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size...
   # 
   # During resampling, the top 5 selected variables (out of a possible 9):
   #    Bare.nuclei (100%), Bl.cromatin (100%), Cell.shape (100%), Cell.size (100%), Cl.thickness (100%)
   # 
   # On average, 9 variables were selected (min = 9, max = 9)

   ## Class of sbf_roc
   class(sbf_roc)
   # [1] "sbf"

   ## Names of elements of sbf_roc
   names(sbf_roc)
   #  [1] "pred"         "variables"    "results"      "fit"          "optVariables"
   #  [6] "call"         "control"      "resample"     "metrics"      "times"       
   # [11] "resampledCM"  "obsLevels"    "dots"        

   ## sbf_roc fit element*  
   sbf_roc$fit
   # Random Forest 
   # 
   # 683 samples
   #   9 predictor
   #   2 classes: 'benign', 'malignant' 
   # 
   # No pre-processing
   # Resampling: Cross-Validated (10 fold, repeated 1 times) 
   # Summary of sample sizes: 615, 614, 614, 615, 615, 615, ... 
   # Resampling results across tuning parameters:
   # 
   #   mtry  ROC        Sens       Spec       Accuracy   Kappa    
   #   2     0.9933176  0.9706566  0.9833333  0.9751492  0.9460717
   #   5     0.9920034  0.9662121  0.9791667  0.9707801  0.9363708
   #   9     0.9914825  0.9684343  0.9708333  0.9693308  0.9327662
   # 
   # ROC was used to select the optimal model using  the largest value.
   # The final value used for the model was mtry = 2. 

   ##  Elements of sbf_roc fit  
   names(sbf_roc$fit)
   #  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
   #  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
   # [11] "finalModel"   "preProcess"   "trainingData" "resample"      "resampledCM" 
   # [16] "perfNames"    "maximize"     "yLimits"      "times"        "levels"      

   ## sbf_roc fit final Model
   sbf_roc$fit$finalModel

   # Call:
   #  randomForest(x = x, y = y, mtry = param$mtry) 
   #                Type of random forest: classification
   #                      Number of trees: 500
   # No. of variables tried at each split: 2
   # 
   #         OOB estimate of  error rate: 2.34%
   # Confusion matrix:
   #           benign malignant class.error
   # benign       431        13  0.02927928
   # malignant      3       236  0.01255230

   ## sbf_roc metric
   sbf_roc$fit$metric
   # [1] "ROC"

   ## sbf_roc fit best Tune  
   sbf_roc$fit$bestTune
   #   mtry
   # 1    2

是否sbf()使用metric参数来优化模型？如果是，默认使用什么metric？sbf()如果sbf()使用metric参数，那么如何将其设置为ROC？

谢谢。

score 3 · Accepted Answer

sbf不使用度量来优化任何东西（不像rfe）；所做sbf的只是在调用模型之前做一个特征选择步骤。当然，您定义了过滤器，但无法使用它来调整过滤器，sbf因此不需要指标来指导该步骤。

Usingsbf(x, y, metric = "ROC")将传递metric = "ROC"给您正在使用的任何建模函数（并且它设计为在使用train时caretSBF使用。发生这种情况是因为没有metric参数sbf：

> names(formals(caret:::sbf.default))
[1] "x"          "y"          "sbfControl" "..."

r - sbf() 是否使用 metric 参数来优化模型？

1 回答 1

Related

Reference