我可以打开一个 ctpu 会话并从我的 git 存储库中获取我需要的代码,但是当我从云 shell 运行我的 tensorflow 代码时,我收到一条消息说没有 TPU 并且我的程序崩溃了。这是我收到的错误消息:
adrien_doerig@adrien-doerig:~/capser$ python TPU_playground.py
(unset)
INFO:tensorflow:Querying Tensorflow master () for TPU system metadata.
2018-07-16 09:45:49.951310: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Failed to find TPU: _TPUSystemMetadata(num_cores=0, num_hosts=0, num_of_cores_per_host=0, topology=None, devices=[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456)])
Traceback (most recent call last):
File "TPU_playground.py", line 79, in <module>
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 363, in train
hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2068, in _convert_train_steps_to_hooks
if ctx.is_running_on_cpu():
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_context.py", line 339, in is_running_on_cpu
self._validate_tpu_configuration()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_context.py", line 525, in _validate_tpu_configuration
'are {}.'.format(tpu_system_metadata.devices))
RuntimeError: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s). Available devices are [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU
当我打开另一个 shell 并输入“ctpu status”时,我看到我的 tpu 集群正在运行,但我收到以下恐慌错误:
adrien_doerig@capser-210106:~$ ctpu status
Your cluster is running!
Compute Engine VM: RUNNING
Cloud TPU: RUNNING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x671b7e]
goroutine 1 [running]:
github.com/tensorflow/tpu/tools/ctpu/commands.
(*statusCmd).Execute(0xc4200639e0, 0x770040, 0xc4200160d0, 0xc4200568a0, 0x0,
0x0, 0x0, 0x6dddc0)
/tmp/ctpu-
release/src/github.com/tensorflow/tpu/tools/ctpu/commands/status.go:214 +0x5ce
github.com/google/subcommands.(*Commander).Execute(0xc420070000, 0x770040,
0xc4200160d0, 0x0, 0x0, 0x0, 0x5)
/tmp/ctpu-release/src/github.com/google/subcommands/subcommands.go:141
+0x29f
github.com/google/subcommands.Execute(0x770040, 0xc4200160d0, 0x0, 0x0, 0x0,
0xc420052700)
/tmp/ctpu-release/src/github.com/google/subcommands/subcommands.go:385
+0x5f
main.main()
/tmp/ctpu-release/src/github.com/tensorflow/tpu/tools/ctpu/main.go:87
+0xd5e
我尝试了此处建议的故障排除:https ://cloud.google.com/tpu/docs/troubleshooting 但它不起作用,因为当我进入时一切看起来都很正常
gcloud compute tpus list
我也尝试过创建一个全新的项目,甚至使用不同的谷歌帐户,但它并没有解决问题。我没有发现任何关于云 TPU 的类似错误。我错过了一些明显的东西吗?
感谢您的帮助!