1

我正在使用优秀的 R 包 caret,我想在多个训练数据集的列表上运行 train 函数。现在,我意识到 train 函数的文档说 data 参数必须是一个数据框,所以我试图做的事情可能根本不可能,这可能更好地建议作为对插入符号的增强,但我想看看是否有人尝试过这样做。

出于说明目的使用声纳数据,我创建了一个列表(命名为两者),由两个数据帧组成,每个数据帧都是一个单独的训练数据集。然后我使用 mapply 将 train 函数应用于列表中的每个元素。不幸的是,我得到了可怕的结果。具体来说,我希望 pls1.3..A[[2]] 中的指标与 pls1.3..B2 中的指标相同。如您所见,它们不是。奇怪的是,pls1.3..A[[1]] 匹配 pls1.3..B1。有什么明显的我做错了,或者这可能是不可能的(现在)?(我在 1.4 GHz Intel Core i5 Mac 上运行 R 3.1.1。)

可重现的代码(和注释掉的输出)如下:

    require(doMC)
    registerDoMC(cores = 2) 

    library(caret) 
    library(mlbench) 
    data(Sonar) 
    set.seed(1234) 
    inTrain <- createDataPartition(y = Sonar$Class, 
                                   p = .75, 
                                    list = FALSE) 

    training <- Sonar[ inTrain,] 
    training2  <- Sonar[-inTrain,] 

    both <- list(training, training2) 
    #both_test <- list(training[c(1:100),], training2[c(1:35),]) #SILLY test data for functionality testing only 

    set.seed(1234) 

    labels <- list() 
    for(i in 1:length(both)) { 
        labels[i] <- list(both[[i]]$Class) 
        } 

    #NEW CODE -- ADDED BASED ON @Josh W's comment -- removing the label (Class) variable from the feature matrix
    both <- lapply(both, function(x) {
        subset(x[,c(1:60)])
        })

    #NEW CODE -- changed from using the formula implementation of caret to the x (feature matrix), y (label/outcome vector)

    pls1.3..A <- mapply(function(x,y) train(x, y, method = "pls", preProc = c("center", "scale")), x = both, y = labels, SIMPLIFY = FALSE) 
    pls1.3..A 

    #[[1]]
    #Partial Least Squares 

    #157 samples
    # 60 predictor
    #  2 classes: 'M', 'R' 

    #Pre-processing: centered, scaled 
    #Resampling: Bootstrapped (25 reps) 

    #Summary of sample sizes: 157, 157, 157, 157, 157, 157, ... 

    #Resampling results across tuning parameters:

    #  ncomp  Accuracy   Kappa      Accuracy SD  Kappa SD  
    #  1      0.6889679  0.3756821  0.06015197   0.11605511
    #  2      0.7393776  0.4742204  0.04962609   0.09775688
    #  3      0.7410997  0.4793703  0.04856698   0.09412599

    #Accuracy was used to select the optimal model using  the largest value.
    #The final value used for the model was ncomp = 3. 

    #[[2]]
    #Partial Least Squares 

    #51 samples
    #60 predictors
    # 2 classes: 'M', 'R' 

    #Pre-processing: centered, scaled 
    #Resampling: Bootstrapped (25 reps) 

    #Summary of sample sizes: 51, 51, 51, 51, 51, 51, ... 

    #Resampling results across tuning parameters:

    #  ncomp  Accuracy   Kappa      Accuracy SD  Kappa SD 
    #  1      0.6452693  0.2929118  0.08076455   0.1525176
    #  2      0.6468405  0.2902136  0.09686340   0.1790924
    #  3      0.6559113  0.3087227  0.08025215   0.1547317

    #Accuracy was used to select the optimal model using  the largest value.
    #The final value used for the model was ncomp = 3.          

    set.seed(1234)
    pls1.3..B1 <- train(both[[1]],
                    labels[[1]],
                    method = "pls",
                    preProc = c("center", "scale"))
    pls1.3..B1
    #Partial Least Squares 

    #157 samples
    # 60 predictor
    #  2 classes: 'M', 'R' 

    #Pre-processing: centered, scaled 
    #Resampling: Bootstrapped (25 reps) 

    #Summary of sample sizes: 157, 157, 157, 157, 157, 157, ... 

    #Resampling results across tuning parameters:

    #  ncomp  Accuracy   Kappa      Accuracy SD  Kappa SD  
    #  1      0.6889679  0.3756821  0.06015197   0.11605511
    #  2      0.7393776  0.4742204  0.04962609   0.09775688
    #  3      0.7410997  0.4793703  0.04856698   0.09412599

    #Accuracy was used to select the optimal model using  the largest value.
    #The final value used for the model was ncomp = 3. 

    set.seed(1234)
    pls1.3..B2 <- train(both[[2]],
                    labels[[2]],
                    method = "pls",
                    preProc = c("center", "scale"))
    pls1.3..B2

    #Partial Least Squares 

    #51 samples
    #60 predictors
    # 2 classes: 'M', 'R' 

    #Pre-processing: centered, scaled 
    #Resampling: Bootstrapped (25 reps) 

    #Summary of sample sizes: 51, 51, 51, 51, 51, 51, ... 

    #Resampling results across tuning parameters:

    #  ncomp  Accuracy   Kappa      Accuracy SD  Kappa SD 
    #  1      0.6127279  0.2518488  0.11925682   0.1959400
    #  2      0.6792163  0.3618657  0.09386771   0.1776549
    #  3      0.6673662  0.3343716  0.07524373   0.1476405

    #Accuracy was used to select the optimal model using  the largest value.
    #The final value used for the model was ncomp = 2.  
4

1 回答 1

0

如果您使用以下内容,您将获得您期望的结果(接近):

set.seed(1234) 
pls1.3..B <- train(labels[[2]]~ ., 
                   data = both[[2]], 
                   method = "pls", 
                   preProc = c("center", "scale")) 
pls1.3..B 

我相信这是因为您指定公式的方式。object ~ .让公式使用数据中不是 column 的所有内容object。在 mapply 调用中指定,它是basically external object ~ entire data.frame,包括 Class 标签。所以我相信这就像在数据集中使用你的响应变量进行训练。

于 2015-05-31T02:06:56.883 回答