1

我正在尝试通过参考 TensorFlow 1 中的实现来在 TensorFlow 2 中实现强化学习算法。

在算法中我需要使用分布式训练。在 tf1 中,我所指的实现使用tf.train.Server并且我想使用tf.distribute.MirroredStrategy。但是,我无法找到它们之间是否存在联系。我认为它们都用于分布式学习,但它们彼此之间有多相似?或者他们彼此有多大不同?

github 中的代码可以帮助更好地理解第468-483 行的问题(此处):

def train(action_set, level_names):
  """Train."""

  if is_single_machine():
    local_job_device = ''
    shared_job_device = ''
    is_actor_fn = lambda i: True
    is_learner = True
    global_variable_device = '/gpu'
    server = tf.train.Server.create_local_server()
    filters = []
  else:
    local_job_device = '/job:%s/task:%d' % (FLAGS.job_name, FLAGS.task)
    shared_job_device = '/job:learner/task:0'
    is_actor_fn = lambda i: FLAGS.job_name == 'actor' and i == FLAGS.task
    is_learner = FLAGS.job_name == 'learner'

    # Placing the variable on CPU, makes it cheaper to send it to all the
    # actors. Continual copying the variables from the GPU is slow.
    global_variable_device = shared_job_device + '/cpu'
    cluster = tf.train.ClusterSpec({
        'actor': ['localhost:%d' % (8001 + i) for i in range(FLAGS.num_actors)],
        'learner': ['localhost:8000']
    })
    server = tf.train.Server(cluster, job_name=FLAGS.job_name,
                             task_index=FLAGS.task)
    filters = [shared_job_device, local_job_device]
    ...
4

0 回答 0