apache-spark - Kubernetes 上的 Spark 作业 - 执行程序被终止

Question

我们正在使用 Spark 运算符在 Kubernetes（EKS 非 EMR）上运行 Spark 作业。一段时间后，一些 executor 得到 SIGNAL TERM，一个来自 executor 的示例日志：

Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.

在驱动端，2分钟后驱动停止接收心跳，然后决定杀死执行者

Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms

我试图了解我们是否在 Kubernetes 级别上超出了某些资源限制，但找不到类似的东西。我可以寻找什么来了解 Kubernetes 杀死执行程序的原因？

跟进：

我错过了驱动程序端的日志消息：

Mar 01 21:04:23.471 Disabling executor 50.

然后在执行者方面：

Mar 01 21:04:23.348 RECEIVED SIGNAL TERM

我查看了哪个类正在编写 Disabling executor log 消息并找到了这个 class KubernetesDriverEndpoint，似乎onDisconnected为所有这些 executors 调用了该方法，并且该方法调用disableExecutor了DriverEndpoint 所以现在的问题是为什么这些 executors 被认为是断开连接的。看看这个网站的解释 https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback 据说那里

远程 RPC 客户端已解除关联。可能是由于容器超过阈值或网络问题。检查驱动程序日志以获取 WARN 消息。

但是我在驱动程序端找不到任何 WARN 日志，有什么建议吗？

apache-spark - Kubernetes 上的 Spark 作业 - 执行程序被终止

0 回答 0

Related

Reference