I have a classification problem and one of the predictors is a categorical variable X with four levels A,B,C,D that was transformed to three dummy variables A,B,C. I was trying to use the Recursive Feature Selection (RFE) in the caret package to conduct feature selection. How do I tell the RFE function to consider A,B,C,D together? so if say A is excluded, B&C are excluded too.
After fighting with this all day, I'm still going nowhere...Feeding RFE using the formula interface also doesn't work. I think RFE automatically converts any factors to dummy variables.
Below is my example code:
#rfe settings
lrFuncs$summary<- twoClassSummary
trainctrl <- trainControl(classProbs= TRUE,
summaryFunction = twoClassSummary)
ctrl<-rfeControl(functions=lrFuncs,method = "cv", number=3)
#Data pre-process to exclude nzv and highly correlated variables
x<-training[,c(1, 4:25, 27:39)]
x2<-model.matrix(~., data = x)[,-1]
nzv <- nearZeroVar(x2,freqCut = 300/1)
x3 <- x2[, -nzv]
corr_mat <- cor(x3)
too_high <- findCorrelation(corr_mat, cutoff = .9)
x4 <- x3[, -too_high]
excludes<-c(names(data.frame(x3[, nzv])),names(data.frame(x3[, too_high])))
#Exclude the variables identified
x_frame<-x[ , -which(names(x) %in% c(excludes))]
#Run rfe
#This does not work with the error below
glmProfile<-rfe(x_frame,y,sizes =subsets, rfeControl = ctrl,trControl =trainctrl,metric = "ROC")
Error in { : task 1 failed - "undefined columns selected"
In addition: Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
#it works if convert x_frame to matrix and then back to data frame, but this way rfe may remove some dummy variables (i.e.remove A but leave B&C)
glmProfile<-rfe(data.frame(model.matrix(~., data = x_frame)[,-1]),y,sizes =subsets, rfeControl = ctrl,trControl =trainctrl,metric = "ROC")
x_frame here, contains categorical variables that have multiple levels.
Any help is highly appreciated!