1

我找到了一篇文章,说明我如何告诉 bsub 在运行here之前等待一组指定的作业完成,但是这只有在事先知道作业数量的情况下才有效。

我想运行任意数量的作业,并在所有作业完成后运行“包装”作业

这是我的脚本:

#!/bin/bash
for file in dir/*; do # I don't know how many jobs will be created
    bsub "./do_it_once.sh $file"
done

bsub -w "done(1) && done(2) && done(3)" merge_results.sh

当提交了 3 个作业时,上面的脚本将起作用,但是如果有 n 个作业呢?如何指定我要等待所有作业完成?

4

4 回答 4

1

编辑查看kamula 的答案,了解实际有效的方法:)。

原始答案

从未使用过bsub,但通过快速浏览手册页,我认为这可能会做到:

#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
    bsub -J "myjobs[$jobnum]" "./do_it_once.sh $file"
    jobnum=$((jobnum + 1))
done

bsub -w "done(myjobs[*])" merge_results.sh

这些作业在一个bsub名为 的数组中使用顺序索引命名myjobs[],使用bash变量jobnum。然后最后一个bsub等待所有myjobs[]作业完成。

YMMV!

哦 - 另外,您可能需要使用-J "\"myjobs[...]\""(with \")。手册页说要用双引号将作业名称括起来,但我不知道这是否是bsub要求,或者他们是否假设您将使用扩展未引用文本的外壳。

于 2016-08-30T14:38:35.667 回答
1

根据cxw 的回复,我得到了一些工作。它不使用数组。但是,-w 命令可以使用通配符,因此我对每个作业进行了类似命名。

仍然不确定这是否是正确的调用方式bsub,因为您需要每次调用一次,但它有效。

从 cxw 编辑:

#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
    bsub -J "myjobs${jobnum}" "./do_it_once.sh $file"
    jobnum=$((jobnum + 1))
done

bsub -w "done(myjobs*)" merge_results.sh
于 2016-08-30T15:25:18.007 回答
0

这是我的完整解决方案,它增加了时间控制并给出了失败作业的数量。如果需要,还会注意杀死失败作业的子进程,并处理僵尸或不间断进程:

function Logger {
    echo "$1"
}

# Portable child (and grandchild) kill function tester under Linux, BSD and MacOS X
function KillChilds {
    local pid="${1}" # Parent pid to kill childs
    local self="${2:-false}" # Should parent be killed too ?


    if children="$(pgrep -P "$pid")"; then
            KillChilds "$child" true
        done
    fi
        # Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
    if ( [ "$self" == true ] && kill -0 $pid > /dev/null 2>&1); then
        kill -s TERM "$pid"
        if [ $? != 0 ]; then
            sleep 15
            Logger "Sending SIGTERM to process [$pid] failed."
            kill -9 "$pid"
            if [ $? != 0 ]; then
                Logger "Sending SIGKILL to process [$pid] failed."
                return 1
            fi
        else
            return 0
        fi
    else
        return 0
    fi
}

function WaitForTaskCompletion {
    local pids="${1}" # pids to wait for, separated by semi-colon
    local soft_max_time="${2}" # If program with pid $pid takes longer than $soft_max_time seconds, will log a warning, unless $soft_max_time equals 0.
    local hard_max_time="${3}" # If program with pid $pid takes longer than $hard_max_time seconds, will stop execution, unless $hard_max_time equals 0.
    local caller_name="${4}" # Who called this function
    local counting="${5:-true}" # Count time since function has been launched if true, since script has been launched if false
    local keep_logging="${6:-0}" # Log a standby message every X seconds. Set to zero to disable logging

    local soft_alert=false # Does a soft alert need to be triggered, if yes, send an alert once
    local log_ttime=0 # local time instance for comparaison

    local seconds_begin=$SECONDS # Seconds since the beginning of the script
    local exec_time=0 # Seconds since the beginning of this function

    local retval=0 # return value of monitored pid process
    local errorcount=0 # Number of pids that finished with errors

    local pid   # Current pid working on
    local pidCount # number of given pids
    local pidState # State of the process

    local pidsArray # Array of currently running pids
    local newPidsArray # New array of currently running pids

    IFS=';' read -a pidsArray <<< "$pids"
    pidCount=${#pidsArray[@]}

    WAIT_FOR_TASK_COMPLETION=""

    while [ ${#pidsArray[@]} -gt 0 ]; do
        newPidsArray=()

        Spinner
        if [ $counting == true ]; then
            exec_time=$(($SECONDS - $seconds_begin))
        else
            exec_time=$SECONDS
        fi

        if [ $keep_logging -ne 0 ]; then
            if [ $((($exec_time + 1) % $keep_logging)) -eq 0 ]; then
                if [ $log_ttime -ne $exec_time ]; then # Fix when sleep time lower than 1s
                    log_ttime=$exec_time
                fi
            fi
        fi

        if [ $exec_time -gt $soft_max_time ]; then
            if [ $soft_alert == true ] && [ $soft_max_time -ne 0 ]; then
                Logger "Max soft execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]."
                soft_alert=true
                SendAlert true

            fi
            if [ $exec_time -gt $hard_max_time ] && [ $hard_max_time -ne 0 ]; then
                Logger "Max hard execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]. Stopping task execution."
                for pid in "${pidsArray[@]}"; do
                    KillChilds $pid true
                    if [ $? == 0 ]; then
                        Logger "Task with pid [$pid] stopped successfully." "NOTICE"
                    else
                        Logger "Could not stop task with pid [$pid]." "ERROR"
                    fi
                done
                SendAlert true
                errrorcount=$((errorcount+1))
            fi
        fi

        for pid in "${pidsArray[@]}"; do
            if [ $(IsNumeric $pid) -eq 1 ]; then
                if kill -0 $pid > /dev/null 2>&1; then
                    # Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
                    #TODO(high): have this tested on *BSD, Mac & Win
                    pidState=$(ps -p$pid -o state= 2 > /dev/null)
                    if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
                        newPidsArray+=($pid)
                    fi
                else
                    # pid is dead, get it's exit code from wait command
                    wait $pid
                    retval=$?
                    if [ $retval -ne 0 ]; then
                        errorcount=$((errorcount+1))
                        Logger "${FUNCNAME[0]} called by [$caller_name] finished monitoring [$pid] with exitcode [$retval]. "DEBUG"
                        if [ "$WAIT_FOR_TASK_COMPLETION" == "" ]; then
                            WAIT_FOR_TASK_COMPLETION="$pid:$retval"
                        else
                            WAIT_FOR_TASK_COMPLETION=";$pid:$retval"
                        fi
                    fi
                fi

            fi
        done

        pidsArray=("${newPidsArray[@]}")
        # Trivial wait time for bash to not eat up all CPU
        sleep .05
    done

    # Return exit code if only one process was monitored, else return number of errors
    if [ $pidCount -eq 1 ] && [ $errorcount -eq 0 ]; then
        return $errorcount
    else
        return $errorcount
    fi
}

用法:

让我们进行 3 个睡眠作业,获取它们的 pid 并将它们发送到 WaitforTaskCompletion:

sleep 10 &
pids="$!"
sleep 15 &
pids="$pids;$!"
sleep 20 &
pids="$pids;$!"

WaitForTaskCompletion $pids 1800 3600 ${FUNCNAME[0]} false 1800

前面的示例会在执行时间超过 1 小时时向您发出警告,在 2 小时后停止执行,并每半小时发送一条“活动”日志消息。

于 2016-08-30T16:05:15.240 回答
0

由于当没有作业挂起/运行时输出bjobs为 1 行 ( No unfinished job found),而当至少有 1 个作业挂起/运行时输出为 2 行:

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
25156   awesome RUN   best_queue superhost   30*host     cool_name  Jun 16 05:38

您可以循环bjobs | wc -l使用:

for job in $some_jobs; 
    bsub < $job

    # Waiting for jobs to complete
    while [[ `bjobs | wc -l` -ge 2 ]] ; do \
        sleep 15
    done
done

这种技术的一个好处是您可以启动多个作业,而不管您需要运行多少。只需在等待之前循环它们。这显然不是最干净的方法,但它目前有效。

for some_jobs in $job_groups; do \
    for job in $some_jobs; do \
        bsub < $job
    done

    # Waiting for jobs to complete
    while [[ `bjobs | wc -l` -ge 2 ]] ; do \
        sleep 15
    done
done
于 2020-06-16T09:48:34.590 回答