0

对于不平衡的分类问题,我使用 imblearn 管道和 sklearn 的 GridSearchCV(以查找最佳超参数)。管道中的步骤如下:

  1. 标准化每个功能
  2. 使用 ADASYN 采样纠正类不平衡
  3. 训练随机森林分类器

使用 GridSearchCV(连同分层 cv)在上述管道上进行超参数搜索。超参数搜索空间包括来自 ADASYN 和随机森林的超参数。

虽然上述方法非常适合在训练验证拆分期间选择最佳超参数,但我认为在预测测试数据集时应用相同的管道是错误的

原因是为了在测试数据集上进行预测,我们不应该使用 ADASYN 采样。测试数据集应按原样预测,无需任何抽样。因此,预测的管道应该是:

  1. 标准化每个功能
  2. ADASYN 采样
  3. 使用经过训练的随机森林分类器进行预测

如何使用 sklearn/imblearn API 以这种方式忽略管道中的特定转换?

我的代码(表达与上述相同的问题):

import pandas as pd
from imblearn.pipeline import Pipeline as imbPipeline
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# Get data
df = pd.read_csv('train.csv')
y_col = 'output'
x_cols = [c for c in df.columns if c != y_col]

# Train and Test data sets
train, test = train_test_split(df, shuffle=True, stratify=df[y_col])

# Define pipeline of transforms and model
pl = imbPipeline([('std', StandardScaler()),
                  ('sample', ADASYN()),
                  ('rf', RandomForestClassifier())])

# Additional code to define params for grid-search omitted.
# params will contain hyper-parameters for ADASYN as well as random forest

# grid search cv
cv = GridSearchCV(pl, params, scoring='f1', n_jobs=-1)
cv.fit(train[x_cols], train[y_col])

# Now that the grid search has been done and the object cv contains the
# best hyper-parameters, I would like to test on test data set:

test_pred = cv.predict(test[x_cols])  # WRONG! No need to do ADASYN sampling!
4

0 回答 0