0

我在 Python 中使用来自 sklearn 的 GridSearchCV 包,我想将它与自定义评分函数一起使用。自定义评分函数需要访问模型中没有的变量。问题是我无法访问训练集中未缩放/未更改的变量,因为它们不包含在使用缩放数据的模型中,并且因为 gridsearch 为每个批次随机选择行。你知道我该如何处理吗?

我尝试创建一个评分函数,将原始(未缩放、未更改)训练集作为参数。它可以工作,但由于 gridsearch 只采用训练集的子集并且行被打乱,我无法将每一行与其在原始训练集中的对应值“连接”起来。我试图对训练集中包含的数据进行缩放,但没有奏效。我考虑过将我想要的未缩放列添加到缩放的训练集中,但是如何将它从模型中排除?

# building pipelines
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('std_scaler', StandardScaler()),
],verbose=True)
cat_pipeline = Pipeline([
    ('one_hot_enc',OneHotEncoder(sparse=False,handle_unknown='ignore')),
],verbose=True)

from sklearn.compose import ColumnTransformer
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, df_num_reg_attributes),
    ("cat", cat_pipeline, df_cat_attributes)
])

# fitting pipelines
X_train_prepared_reg = full_pipeline.fit_transform(X_res_df)
listColPrepared=np.concatenate((df_num_reg_attributes,full_pipeline.named_transformers_['cat'].named_steps['one_hot_enc'].get_feature_names()))
scalerX_train = full_pipeline.named_transformers_['num'].named_steps['std_scaler']
X_test_prepared_reg = full_pipeline.transform(X_test)
y_train = y_balanced

# scorer
def my_scorer(clf, X, y_true):
    DCWorkCost = 5.00
    OPWorkCost = 2.50
    mergedDataset = pd.DataFrame(data=X,index=np.arange(0,len(X)),columns=listColPrepared)
### this is the column I want -- I tried to unscale the data to access the column but it did not work    
    mergedDataset['Margin'] = scalerX_train.inverse_transform(mergedDataset['Margin'])
    mergedDataset['True'] = y_true
    mergedDataset['Pred'] = clf.predict(X)
 # rest of the scorer.........
    return revenue

# grid search
sgd_clf_cv = SGDClassifier(max_iter=5,tol=-np.infty, random_state=42)
parameters = {'class_weight':({0:.1,1:.9},{0:.2,1:.8},{0:.3,1:.7},{0:.25,1:.75},{0:.15,1:.85},{0:.35,1:.65},{0:.4,1:.6})}
grid = GridSearchCV(estimator=sgd_clf_cv, param_grid=parameters, scoring=my_scorer,verbose=10)
grid.fit(X_train_prepared_reg, y_train)
grid.best_estimator_

当尝试按代码所示对数据进行缩放时,我收到一条关于不对应形状的错误消息。

4

1 回答 1

0

拥有您自己的自定义评分函数需要两个步骤,该函数还可以访问另一个常量对象。

  1. 您的自定义得分函数需要传递给make_scorer. 评分函数的格式需要是def f(y_true, y_predicted)
  2. 您的得分函数需要第三个命名参数,您可以在其中添加其他对象。

在您的情况下,代码应该类似于

def my_scorer(y_true, y_pred, scaler=None):
    DCWorkCost = 5.00
    OPWorkCost = 2.50
    mergedDataset = pd.DataFrame(data=X, index=np.arange(0, len(y_true)), columns=listColPrepared)
    ### this is the column I want -- I tried to unscale the data to access the column but it did not work    
    mergedDataset['Margin'] = scaler.inverse_transform(mergedDataset['Margin'])
    mergedDataset['True'] = y_true
    mergedDataset['Pred'] = y_pred
   # rest of the scorer.........

    return revenue

...
scalerX_train = full_pipeline.named_transformers_['num'].named_steps['std_scaler']
...
sgd_clf_cv = SGDClassifier(max_iter=5,tol=-np.infty, random_state=42)
...
custom_score = make_scorer(my_scorer, scaler=scalarX_train)
...
grid = GridSearchCV(estimator=sgd_clf_cv, param_grid=parameters, scoring=custom_score, verbose=10)
于 2019-07-08T19:16:50.743 回答