snakemake - 在snakemake运行期间动态减少输入文件集

Question

这更多是关于蛇形能力的技术问题。我想知道是否可以在snakemake 运行期间动态更改输入样本集。

我想这样做的原因如下：让我们假设一组示例相关联的 bam 文件。第一条规则确定每个样本的质量（基于 bam 文件），即涉及所有输入文件。然而，给定特定的标准，只有一部分样本被认为是有效的，应该进一步处理。所以下一步（例如基因计数或其他）应该只对批准的 bam 文件进行，如下面的最小示例所示：

configfile: "config.yaml"

rule all:
   input: "results/gene_count.tsv"

rule a:
   input: expand( "data/{sample}.bam", sample=config['samples'])
   output: "results/list_of_qual_approved_samples.out"
   shell: '''command'''

rule b:
   input: expand( "data/{sample}.bam", sample=config['valid_samples'])
   output: "results/gene_count.tsv"
   shell: '''command'''

在此示例中，规则 a 将使用有效示例名称列表扩展配置文件，即使我相信知道这是不可能的。

当然，直接的解决方案是有两个不同的输入：1.) 所有 bam 文件和 2.) 列出所有有效文件的文件。这将归结为在规则代码中进行样本选择。

rule alternative_b:
   input: 
      expand( "data/{sample}.bam", sample=config['samples']),
      "results/list_of_qual_approved_samples.out"
   output: "results/gene_count.tsv"
   shell: '''command'''

但是，您是否看到了一种设置规则以实现第一个示例的行为的方法？

非常感谢，拉尔夫

score 1 · Accepted Answer

我想我有一个有趣的答案。

起初我认为这是不可能的。因为Snakemake needs the final files最后。所以你不能一开始就分离一组文件而不知道分离。

但后来我尝试了函数dynamic。使用动态功能，您不必知道规则将创建的文件数量。

所以我编码了这个：

rule all:
   input: "results/gene_count.tsv"

rule a:
   input: expand( "data/{sample}.bam", sample=config['samples'])
   output: dynamic("data2/{foo}.bam")
   shell: 
     './bloup.sh "{input}"'

rule b:
   input: dynamic("data2/{foo}.bam")
   output: touch("results/gene_count.tsv")
   shell: '''command'''

就像在您的第一个示例中一样，snakefile 想要生成一个名为results/gene_count.ts.

从配置文件中获取rule a所有样本。此规则执行选择要创建的文件的脚本。我有4 initial文件（geneA、geneB、geneC、geneD）和only touches two第二个库中的输出（geneA 和geneD 文件）。动态功能没有问题。

rule b获取所有由rule a. 所以你只需要产生results/gene_count.tsv。我只是在示例中触及它。

以下是log of Snakemake更多信息：

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       a
        1       all
        1       b
        3

rule a:
    input: data/geneA.bam, data/geneB.bam, data/geneC.bam, data/geneD.bam
    output: data2/{*}.bam (dynamic)

Subsequent jobs will be added dynamically depending on the output of this rule
./bloup.sh "data/geneA.bam data/geneB.bam data/geneC.bam data/geneD.bam"

Dynamically updating jobs
Updating job b.
1 of 3 steps (33%) done
rule b:
    input: data2/geneD.bam, data2/geneA.bam
    output: results/gene_count.tsv

command
Touching output file results/gene_count.tsv.

2 of 3 steps (67%) done

localrule all:
    input: results/gene_count.tsv

3 of 3 steps (100%) done

score 1 · Accepted Answer

另一种方法，一种不使用“动态”的方法。

并不是您不知道要使用多少个文件，而是您只使用了您将开始使用的文件的子集。由于您能够生成所有潜在文件的“samples.txt”列表，我将假设您有一个坚定的起点。

我做了类似的事情，我有一些我想要处理有效性的初始文件，（在我的例子中，我正在提高质量~排序、索引等）。然后，我想忽略除结果文件之外的所有内容。

我建议，为了避免创建第二个示例文件列表，创建第二个数据目录 (reBamDIR)，data2 (BamDIR)。在 data2 中，您对所有有效文件进行符号链接。这样，Snake 就可以处理 data2 目录中的所有内容。使管道向下移动更容易，管道可以停止依赖样本列表，它可以使用通配符处理所有内容（更容易编码）。这是可能的，因为当我进行符号链接时，我会标准化名称。我在输出规则中列出了符号链接文件，以便 Snakemake 了解它们，然后它可以创建 DAG。

`-- output
    |-- bam
    |   |-- Pfeiffer2.bam ->     /home/tboyarski/share/projects/tboyarski/gitRepo-LCR-BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam
    |   `-- Pfeiffer2.bam.bai ->     /home/tboyarski/share/projects/tboyarski/gitRepo-LCR-    BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam.bai
    |-- fastq
    |-- mPile
    |-- reBam
    |   |-- Pfeiffer2_realigned_sorted.bam
    |   `-- Pfeiffer2_realigned_sorted.bam.bai

在这种情况下，您所需要的只是“验证器”中的返回值，以及响应它的条件运算符。

我认为你已经在某个地方有了这个，因为你必须在验证步骤中使用条件。不要使用它将文件名写入 txt 文件，只需将文件符号链接到最终位置并继续。

我的原始数据在 reBamDIR 中。我存储在 BamDIR 中的最终数据。我只将管道中这个阶段的文件符号链接到 bamDIR。reBamDIR 中还有其他文件，但我不希望管道的其余部分看到它们，所以我将它们过滤掉。

我不完全确定如何实现“验证器”和你的条件，因为我不知道你的情况，我也在学习。只是试图提供替代的观点//方法。

from time import gmtime, strftime

rule indexBAM:
    input:
        expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"])
    output:
        expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam.bai", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"]),
        expand("{outputDIR}/{bamDIR}/{{samples}}.bam.bai", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"]),
        expand("{outputDIR}/{bamDIR}/{{samples}}.bam", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"])
    params:
        bamDIR=config["bamDIR"],
        outputDIR=config["outputDIR"],
        logNAME="indexBAM." + strftime("%Y-%m-%d.%H-%M-%S", gmtime())
    log:
        "log/" + config["reBamDIR"]
    shell:
        "samtools index {input} {output[0]} " \
        " 2> {log}/{params.logNAME}.stderr " \
        "&& ln -fs $(pwd)/{output[0]} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam.bai " \
        "&& ln -fs $(pwd)/{input} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam"

score 0 · Accepted Answer

**这并不完全是您问题的答案，而是实现您目标的建议。**

我认为在管道运行期间修改 yaml 文件是不可能的——或者至少不是微不足道的。

就个人而言，当我运行snakemake 工作流程时，我使用我称之为“元数据”的外部文件。它们包括一个配置文件，还有一个包含样本列表的选项卡文件（可能还有关于所述样本的其他信息）。配置文件包含一个参数，该参数是该文件的路径。

在这样的设置中，我建议让您的“规则 a”输出另一个包含所选样本的选项卡文件，并且该文件的路径可以包含在配置文件中（即使在您启动工作流程时它不存在）。规则 b 会将该文件作为输入。

在您的情况下，您可以：

config:
 samples: "/path/to/samples.tab"
 valid_samples: "/path/to/valid_samples.tab"

我不知道这是否有意义，因为它基于我自己的组织。我认为它很有用，因为它允许存储比样本名称更多的信息，并且如果您有 100 个样本，则管理起来会容易得多！

snakemake - 在snakemake运行期间动态减少输入文件集

3 回答 3

Related

Reference