我会推荐以下两种策略之一:
对多数类的一个子集进行抽样,直到一个更符合较小类的数字。重复多次,每次记录重要特征。然后看看是否有一些特征始终是所有集合中最重要的特征。
重新采样较小的类以获得综合膨胀的样本数。基本上估计它们的协方差,从中抽取随机样本,在该数据上拟合模型(并在估计性能之前删除样本)。所以从某种意义上说,你只是在借用合成数据来稳定模型拟合过程。
第一个可能不太复杂。
这是方法1的简单演示:
## Using the `mpg` dataset, pretending the 'dri' column is of particular interest to us.
##
## 'drv' is a column with three levels, that are not very balanced:
##
## table( mpg$drv )
## 4 f r
## 103 106 25
## Let's sub-sample 25 of each class, it makes sense from the table above
n.per.class <- 25
## let's do the sampling 10 times
n.times <- 10
library(foreach) ## for parallell work
library(doMC)
registerDoMC()
unique.classes <- unique( mpg$drv ) ## or just use levels( mpg$drv ) if you have a factor
variable.importances <- foreach( i=1:n.times ) %dopar% {
j <- sapply(
unique.classes,
function(cl.name) {
sample( which( mpg$drv == cl.name ), size=n.per.class )
},
simplify=FALSE
)
## 'j' is now a named list, we easily turn it to a vector with unlist:
sub.data <- mpg[ unlist(j), ]
## table( sub.data$drv )
## 4 f r
## 25 25 25
##
## 25 of each!
fit <- train.your.model()
varimp <- variable.importance( fit )
## I don't know what varimp will look like in your case, is it a numeric vector perhaps?
}
## variable.importances now is a list with the result of each
## iteration. If it is a vector wiht number for example, the following
## could be useful to have those collected into a matrix:
matrix.of.variable.importances <- Reduce( rbind, variable.importances )
colnames( matrix.of.variable.importances ) <- colnames( your.data )
如果您对方法 2 感兴趣,我建议您查看caret
可以轻松执行此操作的包,但我不知道他们是否支持您的特定方法。