1

尝试在 CPU 上运行分布式张量流示例:

https://github.com/tmulc18/Distributed-TensorFlow-Guide/blob/master/Distributed-Setup/dist_setup.py

运行示例的命令可以在以下位置找到:

https://github.com/tmulc18/Distributed-TensorFlow-Guide/blob/master/Distributed-Setup/run.sh

当我在单个平台(PC-PC 或笔记本电脑-笔记本电脑或 RP(Raspberry pi3)-RP)或具有相同架构的多个平台(PC-笔记本电脑,x86 或 RP-RP,均为 arm64)上运行它时,它运行良好。但是 arm64 和 x86 的组合从 arm64 端失败,并出现以下错误:

2019-06-15 01:20:35.179745: F tensorflow/core/common_runtime/renamed_device.cc:27] Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name) 

PC 的命令是: 请注意,在您的代码中,需要相应地设置 IP。

python dist_setup.py --job_name "worker" --task_index 0

输出:

2019-06-14 18:20:35.040413: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-14 18:20:35.070714: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3593265000 Hz
2019-06-14 18:20:35.071281: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4c9ce60 executing computations on platform Host. Devices:
2019-06-14 18:20:35.071303: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-06-14 18:20:35.072829: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job ps -> {0 -> 10.1.1.2:2222}
2019-06-14 18:20:35.072861: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2223}
2019-06-14 18:20:35.074703: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:2223
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-06-14 18:20:35.178858: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 3634afcffbd6cc2d with config: 
2019-06-14 18:20:45.214939: W tensorflow/core/distributed_runtime/master_session.cc:1363] Timeout for closing worker session
2019-06-14 18:20:55.218267: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2019-06-14 18:21:05.218392: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2019-06-14 18:21:15.218519: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

RP的命令是:

python dist_setup.py --job_name "ps" --task_index 0

输出:

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py:33: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  from tensorflow.python.framework import fast_tensor_util
2019-06-15 01:19:54.226102: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2019-06-15 01:19:54.226278: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.1.1.1:2223}
2019-06-15 01:19:54.227740: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
2019-06-15 01:20:35.179745: F tensorflow/core/common_runtime/renamed_device.cc:27] Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name) 
Aborted

知道为什么我会突然出现这个错误。似乎在服务器连接后立即发出错误。

4

0 回答 0