有时,我们使用 TPU 的基于 GKE TPUEstimator 的训练作业会失败,原因如下:
Error recorded from infeed: Socket closed
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
我对此有两个问题:
- 这里发生了什么?我检查了 pod 的内存使用情况,但没有达到峰值。分配给 pod 的 TPU 也仍然存在。
- 作业并不总是向 pod 提出错误。它继续显示为正在运行,除非有人手动检查状态然后采取措施重新启动它。有什么办法让它总是自动重启?