1

我已经使用在 GPU 上本地运行的 Tensorflow 的对象检测 API(使用model_main.py)和使用 Google 的 ML 引擎(GPU 和 TPU)成功地训练了模型。model_tpu_main.py但是,当在 Google 的云上运行时(使用手动配置的 VM 和 TPU),我似乎无法用来训练模型。

当我model_tpu_main.py使用类似的东西启动时python -m object_detection.model_tpu_main --model_dir=gs://bucket/training --tpu_zone us-central1-b --pipeline_config_path=gs://bucket/training/pipeline.config --job-dir gs://bucket/training --tpu_name mytpu_name,它会卡在:

...
W1113 03:05:16.628712 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/BatchNorm/moving_mean] is not available in checkpoint
W1113 03:05:16.629062 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/BatchNorm/moving_variance] is not available in checkpoint
W1113 03:05:16.629330 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/weights] is not available in checkpoint
2018-11-13 03:06:08.618834: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
...

查看 TPU 日志,我得到的几乎是:

...
Start master session b9186abfa4e15b1d with config: isolate_session_state: true A 
Start master session 48b812f9ca0d3ebf with config: isolate_session_state: true A 
Start master session 33048226cb131f4c with config: isolate_session_state: true A 
Start master session cab95e277a429f9d with config: isolate_session_state: true A 
Start master session 56b5d3296c9bfe15 with config: isolate_session_state: true A 
Start master session 3fdac64b285c365d with config: isolate_session_state: true A 
Start master session ec1fa14806ad9351 with config: isolate_session_state: true A 
...

知道我做错了什么吗?

4

0 回答 0