tensorflow - 基本 TPU 跨分片优化器不工作

Question

一般来说，有一些很好的例子使用 TF 优化器来解决一般（非深度学习）问题。鉴于：

https://databricks.com/tensorflow/training-and-convergence https://colab.research.google.com/notebooks/tpu.ipynb#scrollTo=a_rjVo-RAoYd

我们希望能够将上述两者结合起来，并利用基于 TPU 的优化来解决高维问题。

为此，我有一个简单的 colab 代码，它合并了上面的两个示例：

import tensorflow as tf
import numpy as np
from tensorflow.contrib.tpu.python.tpu import tpu_function
import os
import pprint
import tensorflow as tf

if 'COLAB_TPU_ADDR' not in os.environ:
  print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
  tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  print ('TPU address is', tpu_address)

  with tf.Session(tpu_address) as session:
    devices = session.list_devices()

  print('TPU devices:')
  pprint.pprint(devices)

# Add this somewhere at the top
tpu_function.get_tpu_context().set_number_of_shards(8)

# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3

# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) # TPU change 1





# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()

with tf.Session(tpu_address) as session:
    session.run(tf.contrib.tpu.initialize_system())
    print('init')
    session.run(model)
    for i in range(10000):
        print(i)
        x_value = np.random.rand()
        y_value = x_value * 2 + 6 + 5 + 3
        session.run(optimizer, feed_dict={x: x_value, y: y_value})

    w_value = session.run(w)
    print("Predicted model: {a:.3f}x + {b:.3f}+{c:.3f}x + {d:.3f}".format(a=w_value[0], b=w_value[1], c=w_value[2], d=w_value[3]))
    session.run(tpu.shutdown_system())

当我运行它（在 colab 中）时，它只是运行第一个循环打印：

init
0

然后什么也不做，colab 只是不断跨越。

如果我不使用

optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)

和其他 TPU 功能，然后它可以很好地估计w变量。

问题是：

为什么这不起作用，我们如何让跨分片复制器优化这个简单的功能？
我应该如何塑造变量w以利用 TPU 上的并行批次/分片？
我们如何通过使用等效的数据集prefetch操作或使用输入队列来提高效率？

目标是在lower level没有 TPUEstimator 的情况下使用 TPU API，例如通过仅使用张量、队列和分片来利用 TPU 的强大功能来帮助解决自定义问题。

score 1 · Accepted Answer

它不起作用，因为您在没有实际将计算拆分为分片的情况下覆盖了分片的数量。当我运行您的代码时，我收到以下错误：

InternalError: From /job:tpu_worker/replica:0/task:0:
RET_CHECK failure (platforms/xla/service/jellyfish/lowering/all_reduce_emitter.cc:832) replica_id < target.ReplicaCount() Unexpected replica id in all-reduce, replica_id is 1, target has 1 replicas.


Error encountered while compiling %all-reduce.7 = f32[4]{0:T(256)} all-reduce(f32[4]{0:T(256)} %arg0.1), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.3, metadata={op_type="CrossReplicaSum" op_name="CrossReplicaSum_21"}, backend_config="{barrier_type:3}".

它试图在八个分片上执行计算并组合结果，但它只有一个分片可以使用。看看tf.contrib.tpu.shard。它使用给定数量的分片创建分片上下文，并在这些分片上分配计算。因此，您可以像往常一样定义变量，然后将任何计算与它们一起包装在要分片的函数中，而不是手动设置分片数量：

# REMOVE THIS
# tpu_function.get_tpu_context().set_number_of_shards(8)

# x and y are placeholders for our training data
x_placeholder = tf.placeholder("float")
y_placeholder = tf.placeholder("float")

# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")

# Wrap all of our tensorflow operations in a function we can shard
def calculations(x, y):
  # Our model of y = a*x + b
  y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3

  # Our error is defined as the square of the differences
  # Average across the entire batch
  error = tf.reduce_mean(tf.square(y - y_model))
  # The Gradient Descent Optimizer does the heavy lifting
  train_op = tf.train.AdamOptimizer(0.01)

  return tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)

# Shard the function so that its calculation is distributed
optimizer = tf.contrib.tpu.shard(calculations, inputs=[x_placeholder, y_placeholder], num_shards=8)

您不需要整形w来使用分片，因为分片发生在整个批次维度上，并且所有输入只有一组权重。您需要在输入中添加批次维度，以便每个批次都可以分布在内核中。shard假设第一个维度是批处理维度，但如果您的数据形状不同，则包含一个更改它的参数。根据 TPU 故障排除页面，理想的批量大小是 1024，因此每个 TPU 核心有 128 个样本。如果这对于您的模型来说太大了，那么只要它是 128 的倍数，您就可以变小。查看上面的链接和性能指南，了解有关提高性能的更多提示。
```
for i in range(1000):
    print(i)
    x_value = np.random.rand(1024) # Generate a batch of 1024 values
    y_value = x_value * 2 + 6 + 5 + 3
    session.run(optimizer, feed_dict={x_placeholder: x_value, y_placeholder: y_value})
```
其他一切都应该保持不变。我能够为所有 10000 次迭代训练模型。请记住，对于这个简单的模型，它可能会比使用 CPU/GPU 慢，但是您应该期望对于更大数据集的更复杂问题的性能改进。
我对数据集或输入队列不够熟悉，无法对此发表评论，但shard包含一个输入队列的参数，因此它可能支持它们。您可能不得不使用它来查看它如何将数据获取到计算函数。

tensorflow - 基本 TPU 跨分片优化器不工作

1 回答 1

Related

Reference