0

我一直在使用 ray 在远程 linux 服务器上并行化我的代码。一段时间后作业停止并出现以下错误:

ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
2021-08-19 08:39:21,246 WARNING worker.py:1189 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: c2ac2060eccbb2f78749315d34dda4c52ed9dbf9f1b576b3 Worker ID: a8bbd438a7e16a4de793a02757a8f917668b6cdc2a69a4573b0b9544 Node ID: 4c0199e71ecf4dc12e46ac03e26a9919308ec31899a0fd1dcc93c063 Worker IP address: 134.58.41.155 Worker port: 38967 Worker PID: 4038515

再深入一点,我在其中一名工人的日志文件中发现了这一点:

*** SIGFPE received at time=1629355161 on cpu 6 ***
(pid=4038515) PC: @     0x7f7570f6e5d4  (unknown)  mpz_manager<>::machine_div()
(pid=4038515)     @     0x7f7f09f77420  (unknown)  (unknown)
(pid=4038515)     @     0x7ffc84f0c350  (unknown)  (unknown)
(pid=4038515)     @ ... and at least 1 more frames

如果我使用 Dask 或 Scoop 等其他并行化库,我将面临同样的问题。我也试过谷歌云服务器,问题还是一样。

有趣的是,当我在本地 Mac 机器上以完全相同的并行化运行相同的代码时,代码运行良好。

任何指针将不胜感激!

谢谢

4

0 回答 0