I'm running long jobs (+ 3 hours) on a large Spark cluster on yarn mode. The VMs workers running spark are hosted on the Google Cloud Dataproc and most of them can be destroyed during execution (preemptible VMs that cost less).
When this happens, the job fails because tasks are failing on the destroyed worker with this error in the container logs of the failing worker :
Executor is not registered
I've tried setting spark.task.maxFailures
to 1000 but this doesn't seem to be very effective : even though the job finishes, the tasks doesn't seem to be automatically re-distributed and the computation for the tasks assigned to this specific worker seem to roll back to the initial stage.
Would there be a way of having a more fault tolerant configuration that simply excludes unresponsive executors and re-assigns their tasks?
I could include the ressourcemanager logs, nodemanager and container logs if asked but I don't think it would be relevant.