r - 在 R 中，如何对不平衡的 2 类数据进行特征选择的平衡 10 倍 CV 信息增益测试？

Question

我有一个大型训练数据集data.trn，包含 50 多个变量的 260,000 多个观察值，因变量loan_status由 2 个类组成，"paid off"各自"default"的不平衡约为5:1. 我想使用包中的information.gain命令FSelector将功能减少到最有意义的程度。但是，我担心这种保留的过滤方法会偏向多数类，从而导致对特征的误导性评估。为了避免这种情况，我想出了一种形式sapply基于程序可以通过在 10 个不同的平衡交叉验证折叠上提取几个信息增益测试的平均值来缓解这个问题。我想象可以通过每次获取所有少数类的观察结果并与来自多数类的不同等量观察结果配对来构建折叠。但是，问题是，我是 R 的初学者，所以我不太擅长自己创建这样的结构，所以我想这里有人可以告诉我如何做到这一点，因为我仍然无法理解任务。到目前为止，我只进行了基本信息增益测试，不知道如何制作所需的平衡 CV 版本：

info_gain <- FSelector::information.gain(loan_status ~ ., data.trn)

score 0 · Accepted Answer

我会推荐以下两种策略之一：

对多数类的一个子集进行抽样，直到一个更符合较小类的数字。重复多次，每次记录重要特征。然后看看是否有一些特征始终是所有集合中最重要的特征。
重新采样较小的类以获得综合膨胀的样本数。基本上估计它们的协方差，从中抽取随机样本，在该数据上拟合模型（并在估计性能之前删除样本）。所以从某种意义上说，你只是在借用合成数据来稳定模型拟合过程。

第一个可能不太复杂。

这是方法1的简单演示：


## Using the `mpg` dataset, pretending the 'dri' column is of particular interest to us.
##
## 'drv' is a column with three levels, that are not very balanced:
##
## table( mpg$drv )
##   4   f   r
## 103 106  25

## Let's sub-sample 25 of each class, it makes sense from the table above
n.per.class  <- 25

## let's do the sampling 10 times
n.times <- 10

library(foreach) ## for parallell work
library(doMC)
registerDoMC()

unique.classes <- unique( mpg$drv ) ## or just use levels( mpg$drv ) if you have a factor

variable.importances <- foreach( i=1:n.times ) %dopar% {

    j <- sapply(
        unique.classes,
        function(cl.name) {
            sample( which( mpg$drv == cl.name ), size=n.per.class )
        },
        simplify=FALSE
    )

    ## 'j' is now a named list, we easily turn it to a vector with unlist:
    sub.data <- mpg[ unlist(j), ]

    ## table( sub.data$drv )
    ##  4  f  r
    ## 25 25 25
    ##
    ## 25 of each!


    fit <- train.your.model()
    varimp <- variable.importance( fit )

    ## I don't know what varimp will look like in your case, is it a numeric vector perhaps?

}

## variable.importances now is a list with the result of each
## iteration. If it is a vector wiht number for example, the following
## could be useful to have those collected into a matrix:

matrix.of.variable.importances <- Reduce( rbind, variable.importances )
colnames( matrix.of.variable.importances ) <- colnames( your.data )

如果您对方法 2 感兴趣，我建议您查看caret可以轻松执行此操作的包，但我不知道他们是否支持您的特定方法。

r - 在 R 中，如何对不平衡的 2 类数据进行特征选择的平衡 10 倍 CV 信息增益测试？

1 回答 1

Related

Reference