我正在使用 Dataproc 在使用 spark-shell 的集群上运行 Spark 命令。我经常收到错误/警告消息,表明我与执行者失去了联系。消息如下所示:
[Stage 6:> (0 + 2) / 2]16/01/20 10:10:24 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-cluster- femibyte-w-0.c.gcebook-1039.internal:60599] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.2 in stage 6.0 (TID 17, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.2 in stage 6.0 (TID 16, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)
...
这是另一个示例:
20 10:51:43 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 2 on spark-cluster-femibyte-w-1.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:51:43 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-cluster-femibyte-w-1.c.gcebook-1039.internal:58745] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 5, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 2 idle
这是正常的吗?我能做些什么来防止这种情况发生吗?