我正在使用优秀的 R 包 caret,我想在多个训练数据集的列表上运行 train 函数。现在,我意识到 train 函数的文档说 data 参数必须是一个数据框,所以我试图做的事情可能根本不可能,这可能更好地建议作为对插入符号的增强,但我想看看是否有人尝试过这样做。
出于说明目的使用声纳数据,我创建了一个列表(命名为两者),由两个数据帧组成,每个数据帧都是一个单独的训练数据集。然后我使用 mapply 将 train 函数应用于列表中的每个元素。不幸的是,我得到了可怕的结果。具体来说,我希望 pls1.3..A[[2]] 中的指标与 pls1.3..B2 中的指标相同。如您所见,它们不是。奇怪的是,pls1.3..A[[1]] 匹配 pls1.3..B1。有什么明显的我做错了,或者这可能是不可能的(现在)?(我在 1.4 GHz Intel Core i5 Mac 上运行 R 3.1.1。)
可重现的代码(和注释掉的输出)如下:
require(doMC)
registerDoMC(cores = 2)
library(caret)
library(mlbench)
data(Sonar)
set.seed(1234)
inTrain <- createDataPartition(y = Sonar$Class,
p = .75,
list = FALSE)
training <- Sonar[ inTrain,]
training2 <- Sonar[-inTrain,]
both <- list(training, training2)
#both_test <- list(training[c(1:100),], training2[c(1:35),]) #SILLY test data for functionality testing only
set.seed(1234)
labels <- list()
for(i in 1:length(both)) {
labels[i] <- list(both[[i]]$Class)
}
#NEW CODE -- ADDED BASED ON @Josh W's comment -- removing the label (Class) variable from the feature matrix
both <- lapply(both, function(x) {
subset(x[,c(1:60)])
})
#NEW CODE -- changed from using the formula implementation of caret to the x (feature matrix), y (label/outcome vector)
pls1.3..A <- mapply(function(x,y) train(x, y, method = "pls", preProc = c("center", "scale")), x = both, y = labels, SIMPLIFY = FALSE)
pls1.3..A
#[[1]]
#Partial Least Squares
#157 samples
# 60 predictor
# 2 classes: 'M', 'R'
#Pre-processing: centered, scaled
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 157, 157, 157, 157, 157, 157, ...
#Resampling results across tuning parameters:
# ncomp Accuracy Kappa Accuracy SD Kappa SD
# 1 0.6889679 0.3756821 0.06015197 0.11605511
# 2 0.7393776 0.4742204 0.04962609 0.09775688
# 3 0.7410997 0.4793703 0.04856698 0.09412599
#Accuracy was used to select the optimal model using the largest value.
#The final value used for the model was ncomp = 3.
#[[2]]
#Partial Least Squares
#51 samples
#60 predictors
# 2 classes: 'M', 'R'
#Pre-processing: centered, scaled
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 51, 51, 51, 51, 51, 51, ...
#Resampling results across tuning parameters:
# ncomp Accuracy Kappa Accuracy SD Kappa SD
# 1 0.6452693 0.2929118 0.08076455 0.1525176
# 2 0.6468405 0.2902136 0.09686340 0.1790924
# 3 0.6559113 0.3087227 0.08025215 0.1547317
#Accuracy was used to select the optimal model using the largest value.
#The final value used for the model was ncomp = 3.
set.seed(1234)
pls1.3..B1 <- train(both[[1]],
labels[[1]],
method = "pls",
preProc = c("center", "scale"))
pls1.3..B1
#Partial Least Squares
#157 samples
# 60 predictor
# 2 classes: 'M', 'R'
#Pre-processing: centered, scaled
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 157, 157, 157, 157, 157, 157, ...
#Resampling results across tuning parameters:
# ncomp Accuracy Kappa Accuracy SD Kappa SD
# 1 0.6889679 0.3756821 0.06015197 0.11605511
# 2 0.7393776 0.4742204 0.04962609 0.09775688
# 3 0.7410997 0.4793703 0.04856698 0.09412599
#Accuracy was used to select the optimal model using the largest value.
#The final value used for the model was ncomp = 3.
set.seed(1234)
pls1.3..B2 <- train(both[[2]],
labels[[2]],
method = "pls",
preProc = c("center", "scale"))
pls1.3..B2
#Partial Least Squares
#51 samples
#60 predictors
# 2 classes: 'M', 'R'
#Pre-processing: centered, scaled
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 51, 51, 51, 51, 51, 51, ...
#Resampling results across tuning parameters:
# ncomp Accuracy Kappa Accuracy SD Kappa SD
# 1 0.6127279 0.2518488 0.11925682 0.1959400
# 2 0.6792163 0.3618657 0.09386771 0.1776549
# 3 0.6673662 0.3343716 0.07524373 0.1476405
#Accuracy was used to select the optimal model using the largest value.
#The final value used for the model was ncomp = 2.