我正在尝试选择变量y的最佳预测因子
x1和x3是 y 的预测变量,x2与x1相关,x4是虚拟变量。
library(randomForest);library(caret)
set.seed(123)
x1<-rnorm(1000,sd=.3,mean=-2)
x3<-rnorm(1000,sd=1,mean=.3)
x2<-jitter(x1,amount=1)
x4<-rnorm(1000,sd=4,mean=3)
y<-jitter(3*x1+jitter(x3,amount=2),amount=2)
varImpPlot(randomForest(y~x1+x2+x3+x4,importance=T))
ctrl <- rfeControl(functions = rfFuncs,number=3)
x<-data.frame(x1,x2,x3,x4)
rfe(x,y,rfeControl=ctrl,sizes=1:4,method="rf")
#...
#The top 4 variables (out of 4):
#x3, x1, x2, x4
cor(x)
# x1 x2 x3 x4
# x1 1.00000000 0.45351111 0.08647944 -0.02470308
# x2 0.45351111 1.00000000 0.03927750 -0.08157149
# x3 0.08647944 0.03927750 1.00000000 0.04357772
# x4 -0.02470308 -0.08157149 0.04357772 1.00000000
- 为什么递归特征消除程序告诉我保留所有预测变量,即使在查看x2和x4无用的变量重要性时非常清楚?