python - scikit-learn 中的目标转换和特征选择

Question

我RFECV在 scikit-learn 中用于特征选择。我想将简单线性模型 ( X,y) 的结果与对数转换模型 (使用X, log(y))的结果进行比较

简单模型： RFECV并cross_val_score提供相同的结果（我们需要将所有折叠的交叉验证的平均分数与RFECV所有特征的分数进行比较：0.66= 0.66，没问题，结果是可靠的）

日志模型：问题：似乎RFECV没有提供转换的方法y。这种情况下的分数是0.55vs 0.53。不过，这是意料之中的，因为我必须手动应用np.log以适应数据：log_seletor = log_selector.fit(X,np.log(y)). 这个 r2 分数是用于y = log(y)，没有inverse_func，而我们需要的是一种方法来拟合模型log(y_train)并使用计算分数exp(y_test)。或者，如果我尝试使用TransformedTargetRegressor，我会得到代码中显示的错误：分类器不公开“coef_”或“feature_importances_”属性

如何解决问题并确保特征选择过程可靠？

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = TransformedTargetRegressor(regressor=linear_model.LinearRegression(),
                                                func=np.log,
                                                inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
###
# log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
# log_seletor = log_selector.fit(X,y) 
# #RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
###
log_selector = RFECV(estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,np.log(y))

print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features") 
print("no of feat: ", selector.n_features_ )

print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2)) 
print("no of feat: ", log_selector.n_features_ )

输出：

**Simple Model**
RFECV, r2 scores:  [0.45 0.6  0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score:  0.66 , same as RFECV score with all features
no of feat:  6

**Log Model**
RFECV, r2 scores:  [0.39 0.5  0.59 0.56 0.55 0.54 0.53 0.53 0.53 0.53]
cross_val, mean r2 score:  0.55
no of feat:  3

score 4 · Accepted Answer

您需要做的就是将这些属性添加到TransformedTargetRegressor：

class MyTransformedTargetRegressor(TransformedTargetRegressor):
    @property
    def feature_importances_(self):
        return self.regressor_.feature_importances_

    @property
    def coef_(self):
        return self.regressor_.coef_

然后在你的代码中，使用它：

log_estimator = MyTransformedTargetRegressor(regressor=linear_model.LinearRegression(),
                                             func=np.log,
                                             inverse_func=np.exp)

score 1 · Accepted Answer

此问题的一种解决方法是确保coef_将属性公开给特征选择模块RFECV。所以基本上你需要 extentTransformedTargetRegressor并确保它暴露了 attribute coef_。我创建了一个子类，该子类将扩展TransformedTargetRegressor并公开coef_，如下所示。

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np

class myestimator(TransformedTargetRegressor):

    def __init__(self,**kwargs):
        super().__init__(regressor=LinearRegression(),func=np.log,inverse_func=np.exp)

    def fit(self, X, y, **kwargs):
        super().fit(X, y, **kwargs)  
        self.coef_ = self.regressor_.coef_
        return self

然后您可以使用myestimator创建代码，如下所示：

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = myestimator(regressor=LinearRegression(),func=np.log,inverse_func=np.exp)

selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,y)

我已经运行了您的示例代码并显示了结果。

样本输出

print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features") 
print("no of feat: ", selector.n_features_ )

print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2)) 
print("no of feat: ", log_selector.n_features_ )


**Simple Model**
RFECV, r2 scores:  [0.45 0.6  0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score:  0.66 , same as RFECV score with all features
no of feat:  6
**Log Model**
RFECV, r2 scores:  [0.41 0.51 0.59 0.59 0.58 0.56 0.54 0.53 0.55 0.55]
cross_val, mean r2 score:  0.55
no of feat:  4

希望这可以帮助！

python - scikit-learn 中的目标转换和特征选择

2 回答 2

Related

Reference