我们在具有 8 个内核和 50GB 内存(单工作者)的单个节点上运行 spark 2.1.0 独立集群。
我们使用以下内存设置在集群模式下运行 spark 应用程序 -
--driver-memory = 7GB (default - 1core is used)
--worker-memory = 43GB (all remaining cores - 7 cores)
最近,我们观察到 executor 经常被 driver/master 杀死并重新启动。我发现下面的驱动程序日志 -
17/12/14 03:29:39 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 3658237 ms exceeds timeout 3600000 ms
17/12/14 03:29:39 ERROR TaskSchedulerImpl: Lost executor 2 on 10.150.143.81: Executor heartbeat timed out after 3658237 ms
17/12/14 03:29:39 WARN TaskSetManager: Lost task 23.0 in stage 316.0 (TID 9449, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms
17/12/14 03:29:39 WARN TaskSetManager: Lost task 9.0 in stage 318.0 (TID 9459, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms
17/12/14 03:29:39 WARN TaskSetManager: Lost task 8.0 in stage 318.0 (TID 9458, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms
17/12/14 03:29:39 WARN TaskSetManager: Lost task 5.0 in stage 318.0 (TID 9455, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms
17/12/14 03:29:39 WARN TaskSetManager: Lost task 7.0 in stage 318.0 (TID 9457, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms
应用程序不是那么占用内存,有几个连接并将数据集写入目录。相同的代码在 spark-shell 上运行,没有任何故障。
寻找集群调整或任何可以减少执行者被杀死的配置设置。