2

I'm running some batches of serial programs on a (very) inhomogeneous SLURM cluster (version 2.6.6-2), using GNU 'parallel' to do the distribution. The problem that I'm having is that some of the nodes finish their tasks a lot faster than the others, and I end up with situations like, for example, a job that's allocating 4 nodes but is only using 1 during half of the simulation.

Is there any way, without administrator privileges, to free one of these unused nodes? I can mitigate the problem by running 4 jobs on individual nodes, or with files containing lists of homogeneous nodes, but it's still far from ideal.

For reference, here are the script files that I'm using (adapted from here)

job.sh

#!/bin/sh

#SBATCH --job-name=test
#SBATCH --time=96:00:00
#SBATCH --ntasks=16
#SBATCH --mem-per-cpu=1024
#SBATCH --ntasks-per-node=4
#SBATCH --partition=normal

# --delay .2 prevents overloading the controlling node
# -j is the number of tasks parallel runs so we set it to $SLURM_NTASKS
# --joblog makes parallel create a log of tasks that it has already run
# --resume makes parallel use the joblog to resume from where it has left off
# the combination of --joblog and --resume allow jobs to be resubmitted if
# necessary and continue from where they left off
parallel="parallel --delay .2 -j $SLURM_NTASKS"
$parallel < command_list.sh

command_list.sh

srun --exclusive -N1 -n1 nice -19 ./a.out config0.dat
srun --exclusive -N1 -n1 nice -19 ./a.out config1.dat
srun --exclusive -N1 -n1 nice -19 ./a.out config2.dat

...

srun --exclusive -N1 -n1 nice -19 ./a.out config31.dat
4

1 回答 1

0

您可以使用该scontrol命令来缩小您的工作:

scontrol update JobId=# NumNodes=#

但是我不确定 Slurm 如何选择要关闭的节点。您可能需要手动选择它们并写下

scontrol update JobId=# NodeList=<names>

请参阅Slurm 常见问题解答中的问题 24 。

于 2014-07-28T19:32:13.587 回答