performance - XGBoost 在训练数据集中具有高 AUC (>0.9)，但在测试/验证数据集中具有低 AUC (<0.7)

Question

我最近使用 xgboost 来预测二进制目标。该数据集有 42k 行和大约 300 个特征。目标率为 1%。我将其分为训练（70%）和测试（30%）数据集。我只使用了 10 个参数进行调优（其中一些实际上没有使用），包括：

learning_rate: 0.01, 0.1
n_estimators: 60, 80, 100
booster: 'gbtree'
max_depth: 4, 6, 8
min_child_weight: 5, 15, 25
gamma: 3, 4, 5
max_delta_step: 0
subsample: 0.8, 1
reg_lambda: 0.8, 0.9
reg_alpha: 0.1, 0.2

我还使用了“early_stopping_rounds = 20”和“eval_metric = 'auc'”。其他 XGBoost 参数将是默认值。

训练数据帧的 AUC 值远大于测试数据帧的 AUC 值。下面给出一个例子：

------ train_ks_xgb is:  0.7261393336193693
------ AUC of training dataframe is:  0.9324993491708145
------ test_ks_xgb is:  0.3024282337048294 
------ AUC of testing dataframe is:  0.6908696386355961

这表明严重的过度拟合，尽管我使用了一些输入参数来防止它。我不知道如何进一步控制过拟合或者我应该考虑调整哪些参数。

顺便说一句，我还运行了随机森林模型，它为测试数据帧提供了更好的 AUC。

------ train_ks_rf is:  0.6561640721752957
------ AUC of the training dataframe is:  0.9102790944806666
------ test_ks_rf is:  0.4510942249240122
------ AUC of the testing dataframe is:  0.7995744680851065

随机森林模型的参数是：

n_estimators = 50
cri = 'entropy'
max_depth = 10
min_samples_split = 500
min_samples_leaf = 200
max_features = 'auto'
random_state = 0
max_leaf_nodes = 100
class_weight = None

根据我的经验，xgboost 的性能应该比随机森林好。有人可以给我一些关于如何改进 xgboost 模型的建议吗？

非常感谢。

performance - XGBoost 在训练数据集中具有高 AUC (>0.9)，但在测试/验证数据集中具有低 AUC (<0.7)

0 回答 0

Related

Reference