2

我正在尝试限制并行化脚本。该脚本的目的是获取 10 个样本/文件夹内的列表,并使用列表的记录执行 samtools 命令,这是最苛刻的部分。

这是简单的版本:

for (10 items)
do
  while read (list 5000 items)
  do
    command 1
    command 2
    command 3
    ...
    samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
    ### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
  done
done

为了使用我们的本地服务器,该脚本包含一个可以工作的分叉命令。但它会分叉,直到服务器的所有资源都用完并且没有其他人可以处理它。

因此,我想parallel -j 50用 gnu 并行实现类似的东西。我在待分叉的samtools命令前面试了一下,比如

parallel -j 50 -k samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &

这不起作用(也尝试使用反引号),我得到了

[main_samview] region "item_from_list" specifies an unknown reference name. Continue anyway.

或者以某种方式甚至vim被调用。但我也不确定这是否是parallel脚本中命令的正确位置。您是否知道如何解决此问题,从而限制分叉的进程数量?

我还想过使用基于 FIFO 的信号量实现这里提到的https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop/103921,但我希望gnu parallel能做我正在寻找的为了?我检查了更多页面,例如https://zvfak.blogspot.de/2012/02/samtools-in-parallel.htmlhttps://davetang.org/muse/2013/11/18/using-gnu-parallel/但通常不是这种组合问题。

这是脚本的更详细版本,以防其中的任何命令可能相关(我听说 awk 反引号和新行通常可能是一个问题?)

cd path_to_data
for SAMPLE_FOLDER in *
do
cd ${SAMPLE_FOLDER}/another_folder
    echo "$SAMPLE_FOLDER was found"

    cat list_with_products.txt | while read PRODUCT_NAME_NO_SPACES 
    do
    PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
        echo "$PRODUCT_NAME with white spaces"
        BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
        grep "$PRODUCT_NAME" file_to_search_through > ${TMP_DIR}/tmp.gff

        cat ${TMP_DIR}/tmp.gff | some 'awk' command  > ${BED_FILENAME}

        samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc > resultfolder/resultfile &
          ### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
        rm ${TMP_DIR}/tmp.gff
    done
    cd (back_to_start)
done
rmdir -p ${OUTPUT_DIR}/tmp
4

1 回答 1

1

首先制作一个将单个样本+单个产品作为输入的函数:

cd path_to_data
# Process a single sample and a single product name
doit() {
    SAMPLE_FOLDER="$1"
    PRODUCT_NAME_NO_SPACES="$2"
    SEQ="$3"
    # Make sure temporary files are named uniquely
    # so any parallel job will not overwrite these.
    GFF=${TMP_DIR}/tmp.gff-$SEQ
    cd ${SAMPLE_FOLDER}/another_folder
    echo "$SAMPLE_FOLDER was found"

    PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
        echo "$PRODUCT_NAME with white spaces"
        BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
        grep "$PRODUCT_NAME" file_to_search_through > $GFF

        cat $GFF | some 'awk' command  > ${BED_FILENAME}

        samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc
        rm $GFF
    rmdir -p ${OUTPUT_DIR}/tmp
}
export -f doit
# These variables are defined outside the function and must be exported to be visible
export BED_DIR
export TMP_DIR
export OUTPUT_DIR
# If there are many of these variables, use env_parallel instead of
# parallel. Then you do not need to export the variables.

如果list_with_products.txt每个样本都相同:

parallel --results outputdir/ doit {1} {2} {#} ::: * :::: path/to/list_with_products.txt

如果list_with_products.txt每个样本不同:

# Generate a list of:
# sample \t product
parallel --tag cd {}\;cat list_with_products.txt ::: * |
  # call doit on each sample,product. Put output in outputdir
  parallel --results outputdir/ --colsep '\t' doit {1} {2} {#}
于 2018-04-05T20:11:56.417 回答