0

我正在尝试学习 slurm 系统,但我在理解上有些困难。我正在尝试使用 sbatch 中的 --array 参数并行运行一堆作业。我希望作业分布在多个节点上,但考虑到时间戳,它们似乎都在同一个节点上运行。

我正在使用的 sbatch 命令:

sbatch  -N 10 -a 0-19 --cpus-per-task 10 test.sh

正在运行的 test.sh 文件:

#!/usr/bin/env bash

#SBATCH -o test_%a.out
#SBATCH -p all.q
#SBATCH --time=1:00:00

srun --cpus-per-task 10 -k --exclusive --ntasks 1 -N 1 echo "`date ` array_index: $SLURM_ARRAY_TASK_ID node: $SLURM_NODEID requested nodes: $SLURM_NNODES  `sleep 3`" 

输出文件:

Thu Feb 12 19:51:28 UTC 2015 array_index: 0 node: 0 requested nodes: 10
Thu Feb 12 19:51:45 UTC 2015 array_index: 10 node: 0 requested nodes: 10
Thu Feb 12 19:51:45 UTC 2015 array_index: 11 node: 0 requested nodes: 10
Thu Feb 12 19:51:49 UTC 2015 array_index: 12 node: 0 requested nodes: 10
Thu Feb 12 19:51:49 UTC 2015 array_index: 13 node: 0 requested nodes: 10
Thu Feb 12 19:51:52 UTC 2015 array_index: 14 node: 0 requested nodes: 10
Thu Feb 12 19:51:52 UTC 2015 array_index: 15 node: 0 requested nodes: 10
Thu Feb 12 19:51:56 UTC 2015 array_index: 16 node: 0 requested nodes: 10
Thu Feb 12 19:51:56 UTC 2015 array_index: 17 node: 0 requested nodes: 10
Thu Feb 12 19:51:59 UTC 2015 array_index: 18 node: 0 requested nodes: 10
Thu Feb 12 19:51:59 UTC 2015 array_index: 19 node: 0 requested nodes: 10
Thu Feb 12 19:51:28 UTC 2015 array_index: 1 node: 0 requested nodes: 10
Thu Feb 12 19:51:32 UTC 2015 array_index: 2 node: 0 requested nodes: 10
Thu Feb 12 19:51:32 UTC 2015 array_index: 3 node: 0 requested nodes: 10
Thu Feb 12 19:51:35 UTC 2015 array_index: 4 node: 0 requested nodes: 10
Thu Feb 12 19:51:35 UTC 2015 array_index: 5 node: 0 requested nodes: 10
Thu Feb 12 19:51:39 UTC 2015 array_index: 6 node: 0 requested nodes: 10
Thu Feb 12 19:51:39 UTC 2015 array_index: 7 node: 0 requested nodes: 10
Thu Feb 12 19:51:42 UTC 2015 array_index: 8 node: 0 requested nodes: 10
Thu Feb 12 19:51:42 UTC 2015 array_index: 9 node: 0 requested nodes: 10
4

2 回答 2

0

SLURM 的作业数组支持首先完成一个批处理作业,然后进行第二个。所以在这里,第一个批处理脚本(可以说$JOB_ID.0)将完成(它只有一个 srun 命令),然后第二个将启动等等。这些只能串行工作。

您可以有一个批处理作业,并在该作业中有多个 srun 命令。这将跨越您想要的多个节点。

于 2015-02-13T10:47:26.623 回答
0

这是一个跨节点传播数组作业的简单脚本...这里的技巧是将 1 个任务分配给多个内核(在下面的脚本中,我们使用 16 个内核...因此 64 核机器将获得 4 个任务 -您可以根据需要进行更改)。我将文件命名为“job_array.sbatch”,可以使用“sbatch -a 1-20 job_array.sbatch”(或您想要使用的任何数组元素)调用它:

#!/bin/bash
#
# invoke using sbatch -a 1-20 job_array.sbatch
#
#SBATCH -n 16 # this requests 16 cores per task, which will effectively spread the job across nodes
#SBATCH -N 1 # on one machine
#SBATCH -J job_array
#SBATCH -t 00:00:30
#
date
echo ""
echo "job_array.sbatch"
echo "  Run several instances of a job using a single script."
#
#  Each job has values of certain environment variables.
#
echo "  SLURM_ARRAY_JOB_ID = " $SLURM_ARRAY_JOB_ID
echo "  SLURM_ARRAY_TASK_ID = " $SLURM_ARRAY_TASK_ID
echo "  SLURM_JOB_ID = " $SLURM_JOB_ID
echo "  SLURM_NODELIST = " $SLURM_NODELIST
#
#  Terminate.
#
echo ""
echo "job_array.sbatch:"
echo "  Normal end of execution."
date
#
exit
于 2017-09-12T22:43:22.353 回答