1

我正在使用带有选项的动态分配运行 Spark 作业(来自 Spark 笔记本)

"spark.master": "yarn-client",
"spark.shuffle.service.enabled": "true",
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.executorIdleTimeout": "30s",
"spark.dynamicAllocation.cachedExecutorIdleTimeout": "1h",
"spark.dynamicAllocation.minExecutors": "0",
"spark.dynamicAllocation.maxExecutors": "20",
"spark.executor.cores": 2

(注意:我不确定这个问题是否是由 dynamicAllocation 引起的)

我使用的是 Spark 1.6.1 版。

如果我取消正在运行的作业/应用程序(通过按下笔记本单元格上的取消按钮,或者通过关闭笔记本服务器和应用程序)并在不久之后(几分钟)重新启动同一个应用程序,我经常会得到以下例外:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, i89810.sbb.ch): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
         at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
         at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
         at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
         at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
         at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
         at org.apache.spark.scheduler.Task.run(Task.scala:89)
         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
         at scala.Option.getOrElse(Option.scala:120)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
         at scala.collection.immutable.List.foreach(List.scala:318)
         at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
         ... 11 more

使用 Yarn ResourceManager,我在重新提交作业之前验证了旧作业不再运行。我仍然认为问题的出现是因为被杀死的工作尚未完全清理并干扰新启动的工作?

有人遇到过同样的问题,知道如何解决吗?

4

1 回答 1

0

启用动态分配时,您需要设置外部 shuffle 服务。否则,删除执行程序时会删除随机文件。这就是Failed to get broadcast_3_piece0 of broadcast_3抛出异常的原因。

有关这方面的更多信息,请参阅官方 spark 文档动态资源分配

于 2016-10-25T20:03:13.183 回答