cluster-computing - qsub returns error when submitting jobs from node

Question

I have a complex fortran MPI application running under a Torque/Maui system. When I run my application it produces a huge unique output (~20 GB). To avoid that, I produced a RunJob script that breaks up the running in 5 pieces, each producing smaller outputs much easier to handle.

For the moment my RunJob script stops correctly at the end of the first piece and also produces the correct output. However, when it tries to restart I get the following error message:

qsub: Bad UID for job execution MSG=ruserok failed validating username/username from compute-0-0.local

I know that this problem comes from the fact the Torque/Maui system by default does not allow a node to submit a job.

In fact, when I type this:

qmgr -c 'l s' | grep allow_node_submit

I got:

allow_node_submit = False

I do not have an administrator account just a user one

My questions are:

Is it possible to set allow_node_submit = true on the gmgr being a user ? How ? (- I guess not)
If question 1 = false, is there another way to work around this ? How ?

All the best.

score 3 · Accepted Answer

不可以，非特权用户不能更改排队系统的设置。不允许从计算节点重新提交的通常原因是一个很好的原因 - 保护集群及其所有用户免受某人意外（或以其他方式）提交快速失败并重新提交一次的脚本的影响 - 或者更糟，不止一次 - 快速淹没调度程序和队列，生成相当于分叉炸弹的批处理队列。即使有这样的限制，由于脚本错误，我们还是让人们不小心一次提交了数以万计的工作。

通常的解决方法是 ssh 到队列提交节点之一并从那里提交脚本，例如在提交脚本的末尾：

ssh queue-head-node qsub /path/to/new/submission/script

这就是我们建议我们的用户处理它的方式，例如这里。这显然只有在集群中启用了无密码/无密码 ssh 时才有效，这是一种常见（但不是普遍）的做法。

或者，如果这是针对仅自动提交一系列继续运行的作业的常见情况，您可以查看您的站点如何处理作业依赖关系，并提交一个作业护卫队，每个作业都依赖于成功完成最后一个，然后将按顺序运行。

cluster-computing - qsub returns error when submitting jobs from node

1 回答 1

Related

Reference