0

我们正在尝试将 XGBoost-Spark 用于我们的项目,并且在使用大数据训练模型时遇到了问题。但同样适用于小数据。训练阶段运行约 2 小时,所有任务几乎同时完成。大约 1200 个任务结束后,所有剩余的执行程序开始失败,我们看到所有执行程序都出现相同的错误。注意:我们是数据工程师,不熟悉机器学习并试图创建由数据科学家创建的原型的生产版本。我们对机器学习概念的了解非常有限。

使用的罐子 - xgboost4j-spark-0.72-criteo-20180518_2.11.jar & xgboost4j-0.72-criteo-20180518_2.10-linux.jar 来自执行程序之一的日志中的错误:

Container id: container_e109_1529510504264_41133_01_000223
Exit code: 255
Shell output: main : command provided 1
main : run as user is svccaddv
main : requested yarn user is svccaddv
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /u/applic/data/hdfs1/hadoop/yarn/local/nmPrivate/application_1529510504264_41133/container_e109_1529510504264_41133_01_000223/container_e109_1529510504264_41133_01_000223.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...


Container exited with a non-zero exit code 255. Last 4096 bytes of stderr 
:ter_prune.cc:74: tree pruning end, 1[23:49:45] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth= roots, 126 extra nodes, 0 pruned nodes, max_depth=6
6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, [23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
0 pruned nodes, max_depth=6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6

[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6

[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
6
: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
Socket RecvAll Error:Connection reset by peer
Socket RecvAll Error:Connection reset by peer

我们正在使用的代码片段:

MLUtils.saveAsLibSVMFile(newtrainingData.rdd, inputTrainPath)
val trainSess = spark.sqlContext.read.format("libsvm").option("numFeatures", "10").load(inputTrainPath)
val paramMap =  List(
  "eta" -> 0.003,
  "max_depth" -> 6,
  "subsample" -> 0.8,
  "colsample_bytree" -> 0.8,
  "silent" -> 0,
  "numEarlyStoppingRounds" -> 100,
  "objective" -> "reg:linear").toMap
val numRound = 1500    
val xgboostModel = XGBoost.trainWithDataFrame(trainSess, paramMap, numRound, nWorkers = trainSess.rdd.getNumPartitions, useExternalMemory = false)

表的大小 ~ 21 GB(存储为 ORC w SNAPPY 压缩) SVM 文件的大小 ~ 160 GB 用于训练的 spark 阶段的输入大小 ~ 460 GB 训练阶段产生的任务 - 4044 个执行程序 - 515 个(大约 - 我们使用动态分配) Executor-cores - 4 Executor-mem - 4G Executor-mem-overhead - 1200 MB Driver-mem - 10G

4

1 回答 1

0

我们找到了解决方法。我们使用 coalesce() 减少了分区的数量 - 这减少了任务的数量。以前我们使用 repartition() 来减少分区,但仍然报错。但即使使用合并,如果分区数超过 1000,作业也会失败。对于一些中等数据集,作业在 1200 和 1500 分区时运行良好。但是我们坚持使用 1000 个分区并且作业运行良好。以前,我们将分区增加到 3k 或 4k 以增加并行度,从而提高性能。但是有了 1k 个分区,性能还不错。

对于寻找另一种解决方法的人,请参考 XGBoost 团队给出的建议 - https://github.com/dmlc/xgboost/issues/3462(我没有尝试过)。

于 2018-07-12T14:13:31.683 回答