tensorflow - 使用 TPU 训练 MNIST 会产生错误

Question

尝试训练时出现以下错误：

python /usr/share/models/official/mnist/mnist_tpu.py \
  --tpu=$TPU_NAME \
  --DATA_DIR=${STORAGE_BUCKET}/data \
  --MODEL_DIR=${STORAGE_BUCKET}/output \
  --use_tpu=True \
  --iterations=500 \
  --train_steps=2000

=>

alexryan@alex-tpu:~/tpu$ ./train-mnist.sh 
W1025 20:21:39.351166 139745816463104 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/share/models/official/mnist/mnist_tpu.py", line 173, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/usr/share/models/official/mnist/mnist_tpu.py", line 152, in main
    tpu_config=tf.contrib.tpu.TPUConfig(FLAGS.iterations, FLAGS.num_shards),
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_config.py", line 207, in __init__
    self._master = cluster.master()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 223, in master
    job_tasks = self.cluster_spec().job_tasks(self._job_name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 269, in cluster_spec
    (compat.as_text(self._tpu), response['health']))
RuntimeError: TPU "alex-tpu" is unhealthy: "TIMEOUT"
alexryan@alex-tpu:~/tpu$

我与说明不同的唯一地方是：

我没有在 cloud shell 中运行 ctpu，而是在 mac 上运行它。

>ctpu version
ctpu version: 1.7

TPU 所在的区域与我的配置的默认区域不同，因此我将其指定为一个选项，如下所示：

>cat ctpu-up.sh 
ctpu up --zone us-central1-b --preemptible

我能够将 MNIST 文件从 vm 移动到 gcs 存储桶，这没问题：

alexryan@alex-tpu:~$ gsutil cp -r ./data ${STORAGE_BUCKET}
Copying file://./data/validation.tfrecords [Content-Type=application/octet-stream]...
Copying file://./data/train-images-idx3-ubyte.gz [Content-Type=application/octet-stream]...

我尝试了（可选）设置 TensorBoard > 运行 cloud_tpu_profiler

转到 Cloud Console > TPU > 并点击您创建的 TPU。找到 Cloud TPU 的服务帐号名称并复制它，例如：

service-11111111118@cloud-tpu.iam.myserviceaccount.com

在存储桶列表中，选择您要使用的存储桶，选择显示信息面板，然后选择编辑存储桶权限。将您的服务帐户名称粘贴到该存储桶的添加成员字段中，然后选择以下权限：

“Cloud Console > TPUs”作为选项不存在，因此我使用了与 VM “Cloud Console > Compute Engine > alex-tpu”
关联的服务帐户

由于最后一条错误消息是“RuntimeError: TPU “alex-tpu” is unhealthy: “TIMEOUT”，所以我使用 ctpu 删除了 vm 并重新创建并再次运行它。这次我遇到了更多错误：

这似乎只是一个警告......

ImportError: file_cache is unavailable when using oauth2client >=
4.0.0 or google-auth

不确定这个...

ERROR:tensorflow:Operation of type Placeholder (reshape_input) is not supported on the TPU. Execution will fail if this op is used in the graph.

这似乎扼杀了训练......

INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562')      [[{{node save/SaveV2}} = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64],
_device="/job:worker/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, conv2d/bias/Read/ReadVariableOp, conv2d/kernel/Read/ReadVariableOp, conv2d_1/bias/Read/ReadVariableOp, conv2d_1/kernel/Read/ReadVariableOp, dense/bias/Read/ReadVariableOp, dense/kernel/Read/ReadVariableOp, dense_1/bias/Read/ReadVariableOp, dense_1/kernel/Read/ReadVariableOp, global_step/Read/ReadVariableOp)]]

Caused by op u'save/SaveV2', defined at:   File "/usr/share/models/official/mnist/mnist_tpu.py", line 173, in <module>
    tf.app.run()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))   File "/usr/share/models/official/mnist/mnist_tpu.py", line 163, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
    saving_listeners=saving_listeners   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
    saving_listeners)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1406, in _train_with_estimator_spec
    log_step_count_steps=self._config.log_step_count_steps) as mon_sess:   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
    self._scaffold.finalize()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
    self._saver.build()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1106, in build
    self._build(self._filename, build_save=True, build_restore=True)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1143, in _build
    build_save=build_save, build_restore=build_restore)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 778, in _build_internal
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 284, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 202, in save_op
    tensors)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1690, in save_v2
    shape_and_slices=shape_and_slices, tensors=tensors, name=name)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

UnimplementedError (see above for traceback): File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562')

更新

我收到这个错误...

INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented

...即使 --use_tpu=False

alexryan@alex-tpu:~/tpu$ cat train-mnist.sh 
python /usr/share/models/official/mnist/mnist_tpu.py \
  --tpu=$TPU_NAME \
  --DATA_DIR=${STORAGE_BUCKET}/data \
  --MODEL_DIR=${STORAGE_BUCKET}/output \
  --use_tpu=False \
  --iterations=500 \
  --train_steps=2000

此堆栈溢出答案表明 tpu 正在尝试写入不存在的文件系统，而不是我指定的 gcs 存储桶。我不清楚为什么会发生这种情况。

score 1 · Accepted Answer

在第一种情况下，您创建的 TPU 似乎处于不健康状态。因此，删除并重新创建 TPU 或整个 VM 是解决此问题的正确方法。

我认为错误出现在第二种情况（您删除了 vm 并再次重新创建它）是因为您的 ${STORAGE_BUCKET} 未定义或不是正确的 GCS 存储桶。它应该是一个 GCS 存储桶。本地路径不起作用并给出以下错误。有关创建 GCS 存储桶的更多信息，请参见https://cloud.google.com/tpu/docs/tutorials/mnist上的“创建云存储存储桶”部分

希望这能回答你的问题。

score 0 · Accepted Answer

遇到同样的问题，发现教程中有错字。如果您检查mnist_tpu.py，您会发现参数必须是小写的。

如果你改变它，它工作正常。

python /usr/share/models/official/mnist/mnist_tpu.py \
  --tpu=$TPU_NAME \
  --data_dir=${STORAGE_BUCKET}/data \
  --model_dir=${STORAGE_BUCKET}/output \
  --use_tpu=True \
  --iterations=500 \
  --train_steps=2000

tensorflow - 使用 TPU 训练 MNIST 会产生错误

2 回答 2

Related

Reference