python - 在 Sci-Kit Learn 中拆分数据集以进行 K 折交叉验证

Question

我被分配了一项任务，需要创建一个决策树分类器并使用训练集和 10 折交叉验证来确定准确率。我查看了文档，cross_val_predict因为我相信这是我需要的模块。

我遇到的问题是数据集的拆分。据我所知，在通常情况下，该train_test_split()方法用于将数据集拆分为 2 - train和test。据我了解，对于 K 折验证，您需要将训练集进一步拆分为 K 个部分。

我的问题是：我是否需要在开始时将数据集拆分为train和test？

score 4 · Accepted Answer

这取决于。我个人的看法是，您必须将数据集拆分为训练集和测试集，然后您可以使用 K-folds 对您的训练集进行交叉验证。为什么？因为在训练后测试并在未见过的示例上微调模型很有趣。

但是有些人只是做交叉验证。这是我经常使用的工作流程：

# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)

# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))

# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
    model = model_with_param
    cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
    print('Mean CV-score with param: ' + str(cv_score.mean()))

# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)

# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)

score 0 · Accepted Answer

简短的回答：否

长答案。 如果你想使用的K-fold validation时候你通常不会最初拆分成train/test.

有很多方法可以评估模型。最简单的一种是使用train/test拆分，在train集合上拟合模型并使用test.

如果您采用交叉验证方法，那么您在每次折叠/迭代期间直接进行拟合/评估。

由你决定选择什么，但我会选择 K-Folds 或 LOOCV。

图中总结了 K-Folds 过程（对于 K=5）：

python - 在 Sci-Kit Learn 中拆分数据集以进行 K 折交叉验证

2 回答 2

Related

Reference