背景:
我编写了一个 python 脚本来将文件从格式转换为另一种格式。此代码使用文本文件 ( subject_list.txt
) 作为输入,并遍历该文本文件中列出的源目录名称(数百个目录,每个目录包含数千个文件),转换它们的内容并将它们存储在指定的输出目录中。
问题:
为了节省时间,我想在高性能集群 (HPC) 上使用此脚本并创建作业以并行转换文件,而不是按顺序遍历列表中的每个目录。
我对 python 和 HPC 都很陌生。我们的实验室以前主要用 BASH 编写,无法访问 HPC 环境,但我们最近获得了 HPC 的访问权限,并决定切换到 Python,所以一切都很新。
问题:
python 中是否有一个模块可以让我在 python 脚本中创建作业?我找到了关于多处理和子进程python 模块的文档,但我不清楚我将如何使用它们。还是我应该采取不同的方法?我还在 stackoverflow 上阅读了一些关于一起使用 slurm 和 python 的帖子,但是我被太多的信息和没有足够的知识来区分要拾取的线程所困扰。任何帮助是极大的赞赏。
环境:
HPC:Red Hat Enterprise Linux Server release 7.4 (Maipo)
python3/3.6.1
slurm 17.11.2
管家部分代码:
# Change this for your study
group="labname"
study="studyname"
# Set paths
archivedir="/projects" + group + "/archive"
sourcedir="/projects/" + group + "shared/DICOMS/" + study
niidir="/projects/" + group + "/shared/" + study + archivedir + "/clean_niftis"
outputlog=niidir + "/outputlog_convert.txt"
errorlog=niidir + "/errorlog_convert.txt"
dcm2niix="/projects/" + group + "/shared/dcm2niix/build/bin/dcm2niix"
# Source the subject list (needs to be in your current working directory)
subjectlist="subject_list.txt"
# Check/create the log files
def touch(path): # make a function:
with open(path, 'a'): # open it in append mode, but don't do anything to it yet
os.utime(path, None) # make the file
if not os.path.isfile(outputlog): # if the file does not exist...
touch(outputlog)
if not os.path.isfile(errorlog):
touch(errorlog)
我坚持的部分:
with open(subjectlist) as file:
lines = file.readlines()
for line in lines:
subject=line.strip()
subjectpath=sourcedir+"/"+subject
if os.path.isdir(subjectpath):
with open(outputlog, 'a') as logfile:
logfile.write(subject+os.linesep)
# Submit a job to the HPC with sbatch. This next line was not in the
# original script that works, and it isn't correct, but it captures
# the gist of what I am trying to do (written in bash).
sbatch --job-name dcm2nii_"${subject}" --partition=short --time 00:60:00 --mem-per-cpu=2G --cpus-per-task=1 -o "${niidir}"/"${subject}"_dcm2nii_output.txt -e "${niidir}"/"${subject}"_dcm2nii_error.txt
# This is what I want the job to do for the files in each directory:
subprocess.call([dcm2niix, "-o", "-b y", niidir, subjectpath])
else:
with open(errorlog, 'a') as logfile:
logfile.write(subject+os.linesep)
编辑 1:
dcm2niix 是用于转换的软件,可在 HPC 上使用。它采用以下标志和路径-o -b y ouputDirectory sourceDirectory
。
编辑2(解决方案):
with open(subjectlist) as file:
lines = file.readlines() # set variable name to file and read the lines from the file
for line in lines:
subject=line.strip()
subjectpath=dicomdir+"/"+subject
if os.path.isdir(subjectpath):
with open(outputlog, 'a') as logfile:
logfile.write(subject+os.linesep)
# Create a job to submit to the HPC with sbatch
batch_cmd = 'sbatch --job-name dcm2nii_{subject} --partition=short --time 00:60:00 --mem-per-cpu=2G --cpus-per-task=1 -o {niidir}/{subject}_dcm2nii_output.txt -e {niidir}/{subject}_dcm2nii_error.txt --wrap="/projects/{group}/shared/dcm2niix/build/bin/dcm2niix -o {niidir} {subjectpath}"'.format(subject=subject,niidir=niidir,subjectpath=subjectpath,group=group)
# Submit the job
subprocess.call([batch_cmd], shell=True)
else:
with open(errorlog, 'a') as logfile:
logfile.write(subject+os.linesep)