python-2.7 - 分布式 TensorFlow 示例不适用于 TensorFlow 0.9

Question

我正在我自己的计算机上使用相同的操作系统和 python 版本尝试这个 tensorflow 分布式教程。我创建了第一个脚本并在终端中运行它，然后我打开另一个终端并运行第二个脚本并得到以下错误：

E0629 10:11:01.979187251   15265 tcp_server_posix.c:284]     bind addr=[::]:2222: Address already in use
E0629 10:11:01.979243221   15265 server_chttp2.c:119]        No address added out of total 1 resolved
Traceback (most recent call last):
File "worker0.py", line 7, in <module>
task_index=0)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/server_lib.py", line 142, in __init__
server_def.SerializeToString(), status)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InternalError: Could not start gRPC server

尝试官方分发教程时出现类似错误。

编辑：我在另一台机器上尝试过这个，我有相同的包，现在我得到以下错误日志：

E0629 11:17:44.500224628   18393 tcp_server_posix.c:284]     bind addr=[::]:2222: Address already in use
E0629 11:17:44.500268362   18393 server_chttp2.c:119]        No address added out of total 1 resolved
Segmentation fault (core dumped)

可能是什么问题？

score 3 · Accepted Answer

问题可能是您为两个工作人员使用相同的端口号 (2222)。每个端口号只能由任何给定主机上的一个进程使用。这就是错误“bind addr=[::]:2222: Address already in use”的意思。

我猜您的集群规范中有两次“localhost:2222”，或者您为两个任务指定了相同的 task_index。

我希望这会有所帮助！

python-2.7 - 分布式 TensorFlow 示例不适用于 TensorFlow 0.9

1 回答 1

Related

Reference