python - 在集群上使用 python 和 PBS 进行“令人尴尬的并行”编程

Question

我有一个生成数字的函数（神经网络模型）。我希望在带有 Torque 的标准集群上使用 PBS 从 python 测试几个参数、方法和不同的输入（意味着函数的数百次运行）。

注意：我尝试了 parallelpython、ipython 等，但从未完全满意，因为我想要更简单的东西。集群处于我无法更改的给定配置中，这种集成 python + qsub 的解决方案肯定会对社区有益。

为了简化事情，我有一个简单的功能，例如：

import myModule
def model(input, a= 1., N=100):
    do_lots_number_crunching(input, a,N)
    pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')

其中input是表示输入的对象，input.name是字符串，并且do_lots_number_crunching可能持续数小时。

我的问题是：是否有正确的方法来转换诸如参数扫描之类的东西

for a in pylab.linspace(0., 1., 100):
    model(input, a)

进入“某事”会为每次调用该model函数启动一个 PBS 脚本？

#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py

我正在考虑一个包含 PBS 模板并从 python 脚本中调用它的函数，但还无法弄清楚（装饰器？）。

score 4 · Accepted Answer

pbs_python[1] 可以解决这个问题。如果experiment_model.py 'a' 作为参数，你可以做

import pbs, os

server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)

attopl = pbs.new_attropl(4)
attropl[0].name  = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'

attropl[1].name  = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'

attropl[2].name  = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'

attrop1[3].name = pbs.ATTR_V

script='''
cd /data/work/
python experiment_model.py %f
'''

jobs = []

for a in pylab.linspace(0.,1.,100):
    script_name = 'experiment_model.job' + str(a)
    with open(script_name,'w') as scriptf:
        scriptf.write(script % a)
    job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
    jobs.append(job_id)
    os.remove(script_name)

 print jobs

[1]：https: //oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

score 3 · Accepted Answer

您可以使用jug（我为类似的设置开发的）轻松完成此操作。

你会写在文件中（例如，model.py）：

@TaskGenerator
def model(param1, param2):
     res = complex_computation(param1, param2)
     pyplot.coolgraph(res)


for param1 in np.linspace(0, 1.,100):
    for param2 in xrange(2000):
        model(param1, param2)

就是这样！

现在您可以在队列上启动“jug 作业”：jug execute model.py这将自动并行化。发生的情况是，每个作业都会循环执行以下操作：

while not all_done():
    for t in tasks in tasks_that_i_can_run():
        if t.lock_for_me(): t.run()

（实际上比这更复杂，但你明白了）。

如果您愿意，它使用文件系统进行锁定（如果您在 NFS 系统上）或 redis 服务器。它还可以处理任务之间的依赖关系。

这不完全是您所要求的，但我相信将其与作业排队系统分开是一种更清洁的架构。

score 2 · Accepted Answer

看起来我参加聚会有点晚了，但几年前我也有同样的问题，即如何将令人尴尬的并行问题映射到 python 中的集群上，并编写了自己的解决方案。我最近在这里将它上传到 github：https ://github.com/plediii/pbs_util

要使用 pbs_util 编写程序，我首先在工作目录中创建一个 pbs_util.ini，其中包含

[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00

然后像这样的python脚本

import pbs_util.pbs_map as ppm

import pylab
import myModule

class ModelWorker(ppm.Worker):

    def __init__(self, input, N):
        self.input = input
        self.N = N

    def __call__(self, a):
        myModule.do_lots_number_crunching(self.input, a, self.N)
        pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')



# You need  "main" protection like this since pbs_map will import this file on the     compute nodes
if __name__ == "__main__":
    input, N = something, picklable
    # Use list to force the iterator
    list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
                     startup_args=(input, N),
                     num_clients=100))

这样就可以了。

score 0 · Accepted Answer

我刚开始使用集群和 EP 应用程序。我的目标（我在图书馆工作）是学习足够多的知识，以帮助校园内的其他研究人员通过 EP 应用程序访问 HPC……尤其是 STEM 以外的研究人员。我还是个新手，但认为指出在 PBS 脚本中使用GNU Parallel来启动具有不同参数的基本 python 脚本可能有助于解决这个问题。在 .pbs 文件中，有两行需要指出：

module load gnu-parallel # this is required on my environment

parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*

# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}`   this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.' 
#     will be substituted into the python call as the second(3rd) argument where the
#     `{}` resides.  These can be simple text files that you use in your 'simple.py'
#     script to pass the parameter sets, filenames, etc.

作为 EP 超级计算的新手，尽管我还不了解“并行”上的所有其他选项，但此命令允许我以不同的参数并行启动 python 脚本。如果您可以提前生成大量参数文件来并行处理您的问题，这将很有效。例如，跨参数空间运行模拟。或者使用相同的代码处理许多文件。

python - 在集群上使用 python 和 PBS 进行“令人尴尬的并行”编程

4 回答 4

Related

Reference