python - 使用 tf.train.Server 和 tf.distribute.MirroredStrategy 对 tensorflow 进行分布式训练

Question

我正在尝试通过参考 TensorFlow 1 中的实现来在 TensorFlow 2 中实现强化学习算法。

在算法中我需要使用分布式训练。在 tf1 中，我所指的实现使用tf.train.Server并且我想使用tf.distribute.MirroredStrategy。但是，我无法找到它们之间是否存在联系。我认为它们都用于分布式学习，但它们彼此之间有多相似？或者他们彼此有多大不同？

github 中的代码可以帮助更好地理解第468-483 行的问题（此处）：

def train(action_set, level_names):
  """Train."""

  if is_single_machine():
    local_job_device = ''
    shared_job_device = ''
    is_actor_fn = lambda i: True
    is_learner = True
    global_variable_device = '/gpu'
    server = tf.train.Server.create_local_server()
    filters = []
  else:
    local_job_device = '/job:%s/task:%d' % (FLAGS.job_name, FLAGS.task)
    shared_job_device = '/job:learner/task:0'
    is_actor_fn = lambda i: FLAGS.job_name == 'actor' and i == FLAGS.task
    is_learner = FLAGS.job_name == 'learner'

    # Placing the variable on CPU, makes it cheaper to send it to all the
    # actors. Continual copying the variables from the GPU is slow.
    global_variable_device = shared_job_device + '/cpu'
    cluster = tf.train.ClusterSpec({
        'actor': ['localhost:%d' % (8001 + i) for i in range(FLAGS.num_actors)],
        'learner': ['localhost:8000']
    })
    server = tf.train.Server(cluster, job_name=FLAGS.job_name,
                             task_index=FLAGS.task)
    filters = [shared_job_device, local_job_device]
    ...

python - 使用 tf.train.Server 和 tf.distribute.MirroredStrategy 对 tensorflow 进行分布式训练

0 回答 0

Related

Reference