slurm - 如何通过 SLURM 管理器分发自定义代码？

Question

我可以使用 SLURM 管理器访问计算机集群。我想实现不同的节点执行我的代码的不同部分。如果我理解正确，如果您的代码编写正确，这可以通过 SLURM 和 srun 命令来实现。它应该类似于这里的 MPI 示例https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html。

但我不明白如何在 TF 中创建此代码。TF 版本 1 有更多信息。如果我尝试这样的事情

jobs={'worker': 4}
cluster=tf.distribute.cluster_resolver.SlurmClusterResolver(jobs=jobs)
server0 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=0)
server1 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=1)
server2 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=2)
server2 = tf.distribute.Server(cluster.cluster_spec(), job_name='worker', task_index=3)

并使用 SLURM 运行它，我收到一个错误，我看到只有第一个服务器已启动，但第二个服务器试图使用相同的地址，即“localhost:8888”。所以本质上，我不知道如何在以后可以通信的不同节点上创建服务器。我应该同时运行不同的脚本吗？我必须使用带有标志或类似东西的命令行吗？

之后，我的想法是使用

with tf.device("/job:worker/task:0"):
#some code
with tf.device("/job:worker/task:1"):
#some other code

分配工作。有什么帮助吗？我认为我无法使用 TF 提供的任何分发策略。

score 0 · Accepted Answer

看来我找到了解决方案，所以我发布它，也许它会对某人有所帮助。看起来

cluster = tf.compat.v1.train.ClusterSpec({'worker': ['n03:2222', 'n04:2223'] })

而不是cluster_resolver解决地址相同的问题。后来，我需要调用一个会话，它必须是与任务1相关的服务器目标的会话（不知道为什么，可能与主节点有关），如下所示：

with tf.compat.v1.Session(server1.target) as sess:
    x=tf.Variable(...)
    for k in range(n):
        y1=f1(x)
        y2=f2(x)
        y1=y1.eval()
        y2=y2.eval()

哪里f1(x)是分配给工人的 tf.function，例如：

@tf.function
def f1(x):
    with tf.device("/job:worker/task:0"):
         y=...
         x.assign(x+1)
    return y

并且f2(x)是相似的，只是任务 1。这一切都在我在 .sh 文件中调用的一个脚本中。

slurm - 如何通过 SLURM 管理器分发自定义代码？

1 回答 1

Related

Reference