我正在尝试通过参考 TensorFlow 1 中的实现来在 TensorFlow 2 中实现强化学习算法。
在算法中我需要使用分布式训练。在 tf1 中,我所指的实现使用tf.train.Server
并且我想使用tf.distribute.MirroredStrategy
。但是,我无法找到它们之间是否存在联系。我认为它们都用于分布式学习,但它们彼此之间有多相似?或者他们彼此有多大不同?
github 中的代码可以帮助更好地理解第468-483 行的问题(此处):
def train(action_set, level_names):
"""Train."""
if is_single_machine():
local_job_device = ''
shared_job_device = ''
is_actor_fn = lambda i: True
is_learner = True
global_variable_device = '/gpu'
server = tf.train.Server.create_local_server()
filters = []
else:
local_job_device = '/job:%s/task:%d' % (FLAGS.job_name, FLAGS.task)
shared_job_device = '/job:learner/task:0'
is_actor_fn = lambda i: FLAGS.job_name == 'actor' and i == FLAGS.task
is_learner = FLAGS.job_name == 'learner'
# Placing the variable on CPU, makes it cheaper to send it to all the
# actors. Continual copying the variables from the GPU is slow.
global_variable_device = shared_job_device + '/cpu'
cluster = tf.train.ClusterSpec({
'actor': ['localhost:%d' % (8001 + i) for i in range(FLAGS.num_actors)],
'learner': ['localhost:8000']
})
server = tf.train.Server(cluster, job_name=FLAGS.job_name,
task_index=FLAGS.task)
filters = [shared_job_device, local_job_device]
...