3

I'm running long jobs (+ 3 hours) on a large Spark cluster on yarn mode. The VMs workers running spark are hosted on the Google Cloud Dataproc and most of them can be destroyed during execution (preemptible VMs that cost less).

When this happens, the job fails because tasks are failing on the destroyed worker with this error in the container logs of the failing worker :

Executor is not registered

I've tried setting spark.task.maxFailures to 1000 but this doesn't seem to be very effective : even though the job finishes, the tasks doesn't seem to be automatically re-distributed and the computation for the tasks assigned to this specific worker seem to roll back to the initial stage.

Would there be a way of having a more fault tolerant configuration that simply excludes unresponsive executors and re-assigns their tasks?

I could include the ressourcemanager logs, nodemanager and container logs if asked but I don't think it would be relevant.

4

1 回答 1

5

这似乎是抢占式工人如何离开集群的回归。

问题不仅仅是对失败的不容忍。抢占式工作者在整个集群的生命周期中不断创建和销毁。每次worker离开时,YARN会等待15m的心跳,然后检测到故障并重新创建容器。这可以使您的工作运行时间更长。

我们将在下一个版本中解决这个问题。

解决方法:

以下将强制工作人员在关闭时离开集群。

创建以下脚本并将其上传到 GCS 存储桶:

#!/bin/sh
sudo sed -i "s/.*Stopping google-dataproc-agent.*/start-stop-daemon --stop --signal USR2 --quiet --oknodo --pidfile \${AGENT_PID_FILE}; sleep 5s/" \
   /etc/init.d/google-dataproc-agent

假设您将其上传到 gs://my-bucket/fix.sh

现在使用此初始化操作重新创建您的集群:

gcloud beta dataproc clusters create my-cluster ... \
  --initialization-actions gs://my-bucket/fix.sh

您可以通过 ssh 进入主节点并在纱线节点列表上设置监视来验证这一点:

gcloud compute ssh my-cluster-m
watch yarn node -list

在另一个终端中,发出一个集群更新命令来减少工作人员的数量,并验证纱线节点的数量是否相应地发生了变化。

于 2015-11-19T20:01:22.877 回答