我会做这样的事情:
library(pROC)
#' Select the N top variables with ROC analysis
#' @param response the class variable name
#' @param predictors the variables names from which to select
#' @param data must contain the predictors as columns
#' @param n the number of
select.top.N.ROC <- function(response, predictors, data, n) {
n <- min(n, length(predictors))
aucs <- sapply(predictors, function(predictor) {
auc(data[[response]], data[[predictor]])
})
return(predictors[order(aucs, decreasing=TRUE)][1:n])
}
top.variables <- select.top.N.ROC("class", paste("V", 1:593, sep=""), myDataFrame, 7)
cat(paste("Variables", paste(top.variables, collapse=", "), "were selected based on the highest value of ROC. "))
与任何单变量特征选择方法一样,您可以选择 7 个完全相关的变量,这些变量不会为您提供任何额外信息,因此选择 V23 就足够了。对于多元数据集,您应该考虑改用多元特征选择方法。