感谢所有的评论和回答。这是我为获得最终解决方案所做的工作:
起初,我有一个 bash 脚本将作业作为作业数组提交到集群,并使用 $PBS_ARRAYID 将不同的命令行参数传递给每个作业:
#PBS -N ondemand/sys/myjobs/default
#PBS -l walltime=24:10:00
#PBS -l file=1gb
#PBS -l nodes=1:ppn=1:gpus=1:default
#PBS -l mem=40MB
#PBS -j oe
#PBS -A PAS0108
#PBS -o Job_array.out
# Move to the directory where the job was submitted
# run the following cmd in shell to submit the job in an array
# qsub -t 1-6 myjob.sh
cd $PBS_O_WORKDIR
cd $TMPDIR
# $PBS_ARRAYID can be used as a variable
# copy data to local storage of the node
cp ~/code/2018_9_28_training_node_6_OSC/* $TMPDIR
cp -r ~/Datasets_1/Processed/PascalContext/ResNet_Output/ $TMPDIR
cp -r ~/Datasets_1/Processed/PascalContext/Truth/ $TMPDIR
cp -r ~/Datasets_1/Processed/PascalContext/Decision_Tree/ $TMPDIR
# currently in $TMPDIR, load modules
module load python/3.6 cuda
# using $PBS_ARRAYID as a variable to pass the corresponding node ID
python train_decision_tree_node.py $PBS_ARRAYID $TMPDIR > training_log_${PBS_ARRAYID}
# saving logs
cp training_log ${HOME}/logs/${PBS_ARRAYID}_node_log/
cp my_results $PBS_O_WORKDIR
我用命令行提交上面的脚本:
qsub -t 1-6 myjob.sh
但是,我从集群中得到一个错误,$TMPDIR
当我运行脚本时,集群中的实际节点无法识别本地目录。
最后,我所做的是使用顶级 bash 脚本在 while 循环中使用不同的命令行参数提交每个作业,并且它起作用了:
run_multiple_jobs.tcsh:
#!/usr/bin/env tcsh
set n = 1
while ( $n <= 5 )
echo "Submitting job for concept node $n"
qsub -v NODE=$n job.pbs
@ n++
end
工作.pbs:
#PBS -A PAS0108
#PBS -N node_${NODE}
#PBS -l walltime=160:00:00
#PBS -l nodes=1:ppn=1:gpus=1
#PBS -l mem=5GB
#PBS -m ae
#PBS -j oe
# copy data
cp -r ~/Datasets/Processed/PascalContext/Decision_Tree $TMPDIR
cp -r ~/Datasets/Processed/PascalContext/Truth $TMPDIR
cp -r ~/Datasets/Processed/PascalContext/ResNet_Output $TMPDIR
# move to working directory
cd $PBS_O_WORKDIR
# run program
module load python/3.6 cuda/8.0.44
python train_decision_tree_node.py ${NODE} $TMPDIR $HOME
# run with run_multiple_jobs.tcsh script