目前正在尝试使用并行处理重现 SVM 递归特征消除算法,但在并行化后端遇到了一些问题。
当 RFE SVM 算法成功并行运行时,这大约需要 250 秒。但是,大多数情况下它永远不会完成计算,需要在 30 分钟后手动关闭。当后者发生时,活动监视器的检查显示尽管 Rstudio 已将其关闭,但内核仍在运行。这些核心需要killall R
从终端终止。
包中的代码片段AppliedPredictiveModeling
如下,删除了多余的代码。
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
## The baseline set of predictors
bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
## The set of new assays
newAssays <- colnames(predictors)
newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
## Decompose the genotype factor into binary dummy variables
predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
predictors$E2[grepl("2", predictors$Genotype)] <- 1
predictors$E3[grepl("3", predictors$Genotype)] <- 1
predictors$E4[grepl("4", predictors$Genotype)] <- 1
genotype <- predictors$Genotype
## Partition the data
library(caret)
set.seed(730)
split <- createDataPartition(diagnosis, p = .8, list = FALSE)
adData <- predictors
adData$Class <- diagnosis
training <- adData[ split, ]
testing <- adData[-split, ]
predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
## This summary function is used to evaluate the models.
fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
## We create the cross-validation files as a list to use with different
## functions
set.seed(104)
index <- createMultiFolds(training$Class, times = 5)
## The candidate set of the number of predictors to evaluate
varSeq <- seq(1, length(predVars)-1, by = 2)
# Beginning parallelization
library(doParallel)
cl <- makeCluster(7)
registerDoParallel(cl)
getDoParWorkers()
# Rfe and train control objects created
ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
saveDetails = TRUE,
index = index,
returnResamp = "final")
fullCtrl <- trainControl(method = "repeatedcv",
repeats = 5,
summaryFunction = fiveStats,
classProbs = TRUE,
index = index)
## Here, the caretFuncs list allows for a model to be tuned at each iteration
## of feature seleciton.
ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
## This options tells train() to run it's model tuning
## sequentially. Otherwise, there would be parallel processing at two
## levels, which is possible but requires W^2 workers. On our machine,
## it was more efficient to only run the RFE process in parallel.
cvCtrl <- trainControl(method = "cv",
verboseIter = FALSE,
classProbs = TRUE,
allowParallel = FALSE)
set.seed(721)
svmRFE <- rfe(training[, predVars],
training$Class,
sizes = varSeq,
rfeControl = ctrl,
metric = "ROC",
## Now arguments to train() are used.
method = "svmRadial",
tuneLength = 12,
preProc = c("center", "scale"),
trControl = cvCtrl)
这不是唯一给我带来问题的模型。有时带有 RFE 的随机森林也会导致同样的问题。原始代码使用包doMQ
,但是,活动监视器的检查显示多个rsession
用作并行化,并且我猜测使用 GUI 运行,因为当计算不停止时关闭它需要中止整个 R 通信并重新启动会话,而不是简单地放弃计算。前者当然有把我的环境擦干净的不幸后果。
我正在使用 2013 年中期的 8 核 MacBook Pro。
知道可能导致此问题的原因是什么吗?有没有办法解决它,如果有,怎么办?有没有办法确保并行化在没有 GUI 的情况下运行而不从终端运行脚本(我想控制执行哪些模型以及何时执行)。
编辑:似乎在退出失败的执行后,R 在所有通过 Caret 并行化的后续任务上都失败了,即使是之前运行的那些。这意味着集群不再运行。