5

I am using sklearn to carry out recursive feature elimination with cross-validation, using the RFECV module. RFE involves repeatedly training an estimator on the full set of features, then removing the least informative features, until converging on the optimal number of features.

In order to obtain optimal performance by the estimator, I want to select the best hyperparameters for the estimator for each number of features(edited for clarity). The estimator is a linear SVM so I am only looking into the C parameter.

Initially, my code was as follows. However, this just did one grid search for C at the beginning, and then used the same C for each iteration.

from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn import svm, grid_search

def get_best_feats(data,labels,c_values):

    parameters = {'C':c_values}

    # svm1 passed to clf which is used to grid search the best parameters
    svm1 = SVC(kernel='linear')
    clf = grid_search.GridSearchCV(svm1, parameters, refit=True)
    clf.fit(data,labels)
    #print 'best gamma',clf.best_params_['gamma']

    # svm2 uses the optimal hyperparameters from svm1
    svm2 = svm.SVC(C=clf.best_params_['C'], kernel='linear')
    #svm2 is then passed to RFECVv as the estimator for recursive feature elimination
    rfecv = RFECV(estimator=svm2, step=1, cv=StratifiedKFold(labels, 5))      
    rfecv.fit(data,labels)                                                     

    print "support:",rfecv.support_
    return data[:,rfecv.support_]

The documentation for RFECV gives the parameter "estimator_params : Parameters for the external estimator. Useful for doing grid searches when an RFE object is passed as an argument to, e.g., a sklearn.grid_search.GridSearchCV object."

Therefore I want to try to pass my object 'rfecv' to the grid search object, as follows:

def get_best_feats2(data,labels,c_values):

    parameters = {'C':c_values   
    svm1 = SVC(kernel='linear')
    rfecv = RFECV(estimator=svm1, step=1, cv=StratifiedKFold(labels, 5), estimator_params=parameters)
    rfecv.fit(data, labels)

    print "Kept {} out of {} features".format((data[:,rfecv.support_]).shape[1], data.shape[1])


    print "support:",rfecv.support_
    return data[:,rfecv.support_]

X,y = get_heart_data()


c_values = [0.1,1.,10.]
get_best_feats2(X,y,c_values)

But this returns the error:

max_iter=self.max_iter, random_seed=random_seed)
File "libsvm.pyx", line 59, in sklearn.svm.libsvm.fit (sklearn/svm   /libsvm.c:1674)
TypeError: a float is required

So my question is: how can I pass the rfe object to the grid search in order to do cross-validation for each iteration of recursive feature elimination?

Thanks

4

1 回答 1

6

所以你想在支持向量机中的 C 中对 RFE 中的每个特征数量进行网格搜索?或者对于 RFECV 中的每个 CV 迭代?从你的最后一句话,我猜是前者。

你可以做到RFE(GridSearchCV(SVC(), param_grid))这一点,但我不确定这是否真的有用。

我认为第二个现在不可能(但很快)。你可以这样做GridSeachCV(RFECV(), param_grid={'estimator__C': Cs_to_try}),但这将两组交叉验证嵌套在一起。

更新: GridSearchCV 没有coef_,所以第一个失败。一个简单的修复:

class GridSeachWithCoef(GridSearchCV):
    @property
    def coef_(self):
        return self.best_estimator_.coef_

然后改用它。

于 2015-04-09T13:16:28.917 回答