r - 使用knn在r中选择变量

Question

我有一个包含 72 个观察值和 592 个变量的数据框（df），其中一个因子类变量（总共 593 个变量，即 dim(df) = 72 593）。我正在寻找一种方法来使用接收器操作特性 (ROC) 选择 7 个变量（包括类变量）来选择最佳 k 值。我想使用这七个变量使用图形模型进行分析，但我不想随机选择变量。我希望我的选择在统计上是合理的。

我希望看到的结果类似于：

根据 ROC 的最大值选择变量 V23、V120、V230、V333、V496、V585、V593。

即我想对高精度的“最佳”预测变量进行分类和选择，以便我可以将这些变量用于图形建模。

我曾尝试使用 caret 包，但我不知道如何操作它来选择可用于其他分析的高精度变量（列）。

多谢你们。相信有人理解我。

谢谢。

库特克斯。

score 0 · Accepted Answer

我会做这样的事情：

library(pROC)

#' Select the N top variables with ROC analysis
#' @param response the class variable name
#' @param predictors the variables names from which to select
#' @param data must contain the predictors as columns
#' @param n the number of 
select.top.N.ROC <- function(response, predictors, data, n) {
    n <- min(n, length(predictors))
    aucs <- sapply(predictors, function(predictor) {
        auc(data[[response]], data[[predictor]])
    })
    return(predictors[order(aucs, decreasing=TRUE)][1:n])
}

top.variables <- select.top.N.ROC("class", paste("V", 1:593, sep=""), myDataFrame, 7)
cat(paste("Variables", paste(top.variables, collapse=", "), "were selected based on the highest value of ROC. "))

与任何单变量特征选择方法一样，您可以选择 7 个完全相关的变量，这些变量不会为您提供任何额外信息，因此选择 V23 就足够了。对于多元数据集，您应该考虑改用多元特征选择方法。

r - 使用knn在r中选择变量

1 回答 1

Related

Reference