python - 在对测试数据集进行预测时，跳过不平衡学习管道中的一些转换步骤（与过采样和欠采样有关）

Question

对于不平衡的分类问题，我使用 imblearn 管道和 sklearn 的 GridSearchCV（以查找最佳超参数）。管道中的步骤如下：

标准化每个功能
使用 ADASYN 采样纠正类不平衡
训练随机森林分类器

使用 GridSearchCV（连同分层 cv）在上述管道上进行超参数搜索。超参数搜索空间包括来自 ADASYN 和随机森林的超参数。

虽然上述方法非常适合在训练验证拆分期间选择最佳超参数，但我认为在预测测试数据集时应用相同的管道是错误的。

原因是为了在测试数据集上进行预测，我们不应该使用 ADASYN 采样。测试数据集应按原样预测，无需任何抽样。因此，预测的管道应该是：

标准化每个功能
~~ADASYN 采样~~
使用经过训练的随机森林分类器进行预测

如何使用 sklearn/imblearn API 以这种方式忽略管道中的特定转换？

我的代码（表达与上述相同的问题）：

import pandas as pd
from imblearn.pipeline import Pipeline as imbPipeline
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

# Get data
df = pd.read_csv('train.csv')
y_col = 'output'
x_cols = [c for c in df.columns if c != y_col]

# Train and Test data sets
train, test = train_test_split(df, shuffle=True, stratify=df[y_col])

# Define pipeline of transforms and model
pl = imbPipeline([('std', StandardScaler()),
                  ('sample', ADASYN()),
                  ('rf', RandomForestClassifier())])

# Additional code to define params for grid-search omitted.
# params will contain hyper-parameters for ADASYN as well as random forest

# grid search cv
cv = GridSearchCV(pl, params, scoring='f1', n_jobs=-1)
cv.fit(train[x_cols], train[y_col])

# Now that the grid search has been done and the object cv contains the
# best hyper-parameters, I would like to test on test data set:

test_pred = cv.predict(test[x_cols])  # WRONG! No need to do ADASYN sampling!

python - 在对测试数据集进行预测时，跳过不平衡学习管道中的一些转换步骤（与过采样和欠采样有关）

0 回答 0

Related

Reference