0

我已经使用 TPUEstimator 编写了一个 tensorflow 代码,但是在 use_tpu=False 模式下运行它时遇到问题。我想在我的本地计算机上运行它,以确保所有操作都与 TPU 兼容。该代码适用于普通的 Estimator。这是我的主代码:

import logging
from tensorflow.contrib.tpu.python.tpu import tpu_config, tpu_estimator, tpu_optimizer
from tensorflow.contrib.cluster_resolver import TPUClusterResolver
from capser_7_model_fn import *
from capser_7_input_fn import *
import subprocess
from absl import flags

flags.DEFINE_bool(
    'use_tpu', False,
    'Use TPUs rather than plain CPUs')

tf.flags.DEFINE_string(
    "tpu", default='$TPU_NAME',
    help="The Cloud TPU to use for training. This should be either the name "
    "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
    "url.")

tf.flags.DEFINE_string("model_dir", LOGDIR, "Estimator model_dir")

flags.DEFINE_integer(
    'save_checkpoints_secs', 1000,
    'Interval (in seconds) at which the model data '
'should be checkpointed. Set to 0 to disable.')

flags.DEFINE_integer(
    'save_summary_steps', 100,
'Number of steps which must have run before showing summaries.')

tf.flags.DEFINE_integer("iterations", 1000,
"Number of iterations per TPU training loop.")

tf.flags.DEFINE_integer("num_shards", 8, "Number of shards (TPU chips).")

tf.flags.DEFINE_integer("batch_size", 1024,
                                "Mini-batch size for the training. Note that this "
                                "is the global batch size and not the per-shard batch.")

FLAGS = tf.flags.FLAGS

if FLAGS.use_tpu:
    my_project_name = subprocess.check_output(['gcloud', 'config', 'get-value', 'project'])
    my_zone = subprocess.check_output(['gcloud', 'config', 'get-value', 'compute/zone'])
    cluster_resolver = TPUClusterResolver(
        tpu=[FLAGS.tpu],
        zone=my_zone,
        project=my_project_name)
    master = TPUClusterResolver(tpu=[os.environ['TPU_NAME']]).get_master()
else:
    master = ''

my_tpu_run_config = tpu_config.RunConfig(
    master=master,
    model_dir=FLAGS.model_dir,
    save_checkpoints_secs=FLAGS.save_checkpoints_secs,
    save_summary_steps=FLAGS.save_summary_steps,
    session_config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True),
    tpu_config=tpu_config.TPUConfig(iterations_per_loop=FLAGS.iterations, num_shards=FLAGS.num_shards),
)


# create estimator for model (the model is described in capser_7_model_fn)
capser = tpu_estimator.TPUEstimator(model_fn=model_fn_tpu,
                                    config=my_tpu_run_config,
                                    use_tpu=FLAGS.use_tpu,
                                    train_batch_size=batch_size,
                                    params={'model_batch_size': batch_size_per_shard})

# train model
logging.getLogger().setLevel(logging.INFO)  # to show info about training progress
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)

我在 model_fn_tpu 中定义了一个胶囊网络,它返回 TPUEstimator 规范。优化器是标准的 AdamOptimizer。我已经进行了此处解释的所有更改https://www.tensorflow.org/guide/using_tpu#optimizer以使我的代码与 TPUEstimator 兼容。我收到以下错误:

Traceback (most recent call last):
  File "C:/Users/doerig/PycharmProjects/capser/TPU_playground.py", line 85, in <module>
    capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 363, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 843, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 856, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 831, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 2016, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 1121, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_estimator.py", line 1317, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "C:\Users\doerig\PycharmProjects\capser\capser_7_model_fn.py", line 101, in model_fn_tpu
    **output_decoder_deconv_params)
  File "C:\Users\doerig\PycharmProjects\capser\capser_model.py", line 341, in capser_model
    loss_training_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step(), name="training_op")
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\training\optimizer.py", line 424, in minimize
    name=name)
  File "C:\Users\doerig\AppData\Local\Continuum\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\tpu\python\tpu\tpu_optimizer.py", line 113, in apply_gradients
    summed_grads_and_vars.append((tpu_ops.cross_replica_sum(grad), var))
AttributeError: module 'tensorflow.contrib.tpu.python.ops.tpu_ops' has no attribute 'cross_replica_sum'

有什么想法可以解决这个问题吗?先感谢您!

4

1 回答 1

1

我怀疑这要么是您使用的 TensorFlow 版本 + Windows 中的错误,要么是您构建的 TensorFlow 的问题。

例如,当我tensorflow\contrib\tpu\python\tpu\tpu_optimizer.py 在 TF 1.4 分支中查找文件时,我看到 tpu_ops 被导入为:

from tensorflow.contrib.tpu.python.ops import tpu_ops

如果你把它追到相关文件,你会看到:

if platform.system() != "Windows":
  # pylint: disable=wildcard-import,unused-import,g-import-not-at-top
  from tensorflow.contrib.tpu.ops.gen_tpu_ops import *

  from tensorflow.contrib.util import loader
  from tensorflow.python.platform import resource_loader
  # pylint: enable=wildcard-import,unused-import,g-import-not-at-top

  _tpu_ops = loader.load_op_library(
      resource_loader.get_path_to_datafile("_tpu_ops.so"))
else:
  # We have already built the appropriate libraries into the binary via CMake
  # if we have built contrib, so we don't need this
  pass

跟进本文发布时存在的其他 TF 分支,我们在1.5 1.6 1.7 1.81.9看到 了类似评论。

我强烈怀疑这不会在 Linux 下发生,但我可能会稍后对此进行测试并编辑此答案。

于 2018-09-28T20:55:23.527 回答