python - 一起使用过采样和 cross_validation 函数的方法是什么？

Question

我试图在分类问题中同时使用cross_validate函数和SMOTE函数，我想知道如何正确地做到这一点。

这是我用来在机器学习分类算法中调用 cross_validation 的简单函数：

def bayes(dataIn, dataOut, cv, statistic):    
    # trainning method
    naive_bayes = GaussianNB()

    # applying the method
    outputBayes = cross_validate(estimator = naive_bayes, 
                                 X = dataIn, y = dataOut, 
                                 cv = cv, scoring = statistic)

    return outputBayes

我访问了cross_validate 文档以搜索是否可以在调用 cross_validate 函数之前确定训练数据集和测试数据集，并且不发送完整的 dataInput 和 dataOutput。因为我想使用 SMOTE 功能，并且要做到这一点，我需要在进行交叉验证之前分离数据集。如果我在跨数据集中使用 SMOTE，结果会出现偏差。

我该如何解决？我应该做我的交叉验证功能吗？我不想这样做，因为 cross_validate 函数返回非常好用，我不知道如何做完全相同的返回。

我看到了其他关于它的问题，但我没有找到那个具体的问题：

SMOTE 过采样和交叉验证

交叉验证和过采样功能 (SMOTE)

过采样是在使用 imblearn 管道进行交叉验证之前还是之后发生的？

score 0 · Accepted Answer

第三个链接实际上描述了您想要的内容。鉴于本文的结果，应在交叉验证过程中的每一折上进行过采样。此过程在使用 IMBLearn 包和管道时完成。该过程将使用该包，并仅指定您的过采样技术（SMOTE）和模型（GaussianNB（））。第三个链接中的代码的快速改编大致显示了您想要的内容。

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),  # this is the oversampling process
        ('classification', GaussianNB()) . # this is where to specify the model
    ])


param_dist = {...[REVIEW DOCUMENTATION FOR CORRECT SET OF PARAMS]
             }

random_search = RandomizedSearchCV(model,
                                   param_dist,
                                   cv=StratifiedKFold(n_splits=5),
                                   n_iter=10,
                                   scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)

python - 一起使用过采样和 cross_validation 函数的方法是什么？

1 回答 1

Related

Reference