python - ValueError 在 scikit-learn 中使用带有 rbf 内核的 SVM 进行递归特征消除

Question

我正在尝试在 scikit-learn 中使用递归特征消除 (RFE) 功能，但不断收到错误消息ValueError: coef_ is only available when using a linear kernel。我正在尝试使用 rbf 内核为支持向量分类器 (SVC) 执行特征选择。网站上的这个例子执行得很好：

print(__doc__)

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.metrics import zero_one_loss

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                       n_redundant=2, n_repeated=0, n_classes=8,
                       n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
          scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation score (nb of misclassifications)")
pl.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
pl.show()

但是，简单地将内核类型从线性更改为 rbf，如下所示，会产生错误：

print(__doc__)

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.metrics import zero_one_loss

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                       n_redundant=2, n_repeated=0, n_classes=8,
                       n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="rbf")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
          scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation score (nb of misclassifications)")
pl.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
pl.show()

这似乎是一个错误，但如果有人能发现我做错了什么，那就太好了。另外，我正在使用 scikit-learn 版本 0.14.1 运行 python 2.7.6。

谢谢您的帮助！

score 11 · Accepted Answer

这似乎是意料之中的结果。RFECV要求估计器有一个coef_表示特征重要性的值：

估计器：对象

具有拟合方法的监督学习估计器，该方法更新保存拟合参数的 coef_ 属性。重要特征必须对应 coef_ 数组中的高绝对值。

根据文档，通过将内核更改为 RBF，SVC不再是线性的并且coef_属性变得不可用：

系数_

数组，形状 = [n_class-1，n_features]

分配给特征的权重（原始问题中的系数）。这仅在线性内核的情况下可用。

当 RFECV在内核不是线性的情况下尝试访问时，SVC （源代码）会引发该错误。coef_

python - ValueError 在 scikit-learn 中使用带有 rbf 内核的 SVM 进行递归特征消除

1 回答 1

Related

Reference