snakemake - 理解和克服snakemake中的AmbiguousRuleException

Question

我有一个复杂的工作流程，我逐渐扩展。最后一个扩展导致了AmbiguousRuleException. 我试图在以下示例中重现工作流的关键结构：

NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]


rule all:
    input:
        # (1)
        expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
        #expand("results/allthings/{word}_{choice}.md5sum", word=WORDS + ["all"], choice=CHOICES)

rule make_things:
    output:
        "results/{letter}_{number}/{word}_{choice}.txt"
    shell:
        """
        echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
        """

rule gather_things:
    input:
        expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
    output:
        "results/allthings/{word}_{choice}.txt"
    shell:
        """
        cat {input} > {output}
        """

# (2)
#rule join_all_words:
#    input:
#        expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
#    output:
#        "results/allthings/all_{choice}.txt"
#    shell:
#        """
#        cat {input} > {output}
#        """
# (3)
#def source_data(wildcards):
#    if wildcards.word == "all":
#        return rules.join_all_words.output
#    else:
#        return rules.gather_things.output

rule compute_md5:
    input:
        # (4)
        rules.gather_things.output,
        #source_data
    output:
        "results/allthings/{word}_{choice}.md5sum"
    shell:
        """
        md5sum {input} > {output}
        """

上述状态是功能性的。切换(1)和(4)取消注释(2)并(3)对应于我正在尝试制作的扩展名，并导致以下失败：

AmbiguousRuleException:
Rules gather_things and join_all_words are ambiguous for the file results/allthings/all_yes.txt.
Expected input files:
    gather_things: results/a_1/all_yes.txt results/a_2/all_yes.txt results/b_1/all_yes.txt results/b_2/all_yes.txt results/c_1/all_yes.txt results/c_2/all_yes.txt
    join_all_words: results/allthings/foo_yes.txt results/allthings/bar_yes.txt results/allthings/baz_yes.txt

似乎snakemake认为results/allthings/all_yes.txt可以由gather_things.

为什么？

我怎样才能避免这种情况？

注意：修改的目标(3)是(4)同时处理(for , and )compute_md5的直接输出和三个 ( ) 的联合输出，尽可能根据其他规则的输出定义输入（这会产生变化比显式使用文件名更容易）。gather_thingsfoobarbazall

score 2 · Accepted Answer

2017-07-28 为简洁起见编辑了帖子

起初我认为这只是模棱两可。前 3 点与解决歧义有关。之后，我将解释如何概括“compute_md5”以实现所需的行为。

控制歧义

1）控制流的歧义：

规则顺序 http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules

我建议在以下情况下避免这种情况。在模块化的宏伟希望中，通过使用“规则顺序”，您基本上将两个规则耦合在一起。只有当两条规则都存在于 Snakefile 的范围内时，才能使用“规则顺序”功能。如果不总是一起提供规则，这可能是模块化的问题。如果它们的规则总是一起提供，我认为它们已经是耦合的，这样做不会使情况变得更糟，事实上，会增加凝聚力。使用“约束”还不够时使用“规则顺序”，因为有时在哪里会出现不可避免的歧义。

https://en.wikipedia.org/wiki/GRASP_(面向对象设计)

有条件的“包含” https://github.com/tboyarski/BCCRC-Snakemake/tree/master/modules/bamGen

规则顺序在“_INCLUDE”中 sam2BAM 和 bamALIGN_bwa 的输出非常相似，主要是因为 sam2BAM 非常通用。

因为 bamALIGN_bwa 和 bamALIGN_star 在技术上是可切换的，而且我不希望用户交换规则顺序只是为了在它们之间切换，所以我有一个布尔值，我存储在我的 YAML 文件中，作为硬过滤器来防止 Snakemake 甚至看到规则。这在您只能选择一个或另一个的情况下非常有效（在这种情况下，两个比对器有自己的参考基因组。我强迫用户在我的管道的乞求时设置参考基因组，这样用户就永远不会实际运行两者.我还没有实现功能来检测正在使用哪个参考基因组，以便选择相应的对齐器。这将是一些开销 python 代码，好主意，但目前尚未实现）。

2) 要求 Snakemake 忽略歧义。

超车。它存在，但我认为应尽可能避免使用“--allow-ambiguity”。

http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=--allow-ambiguity#handling-ambiguous-rules

3）优雅地~防止歧义。

http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=wildcard_constraints#wildcards

rule gather_things:
     input:
         expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
     output:
         "results/allthings/{word}_{choice}.txt"
      wildcard_constraints:
         word='[^(all)][0-9a-zA-Z]*'
...

此规则需要一个 wildcard_constraint，以防止它与“join_all_words”规则竞争。这可以通过防止此处的通配符“word”成为字符串“all”来轻松完成。这使得“gather_things”和“join_all_words”可以区分。

compute_md5 泛化性

至于让“compute_md5”接受来自“gather_things”和“join_all_words”的输入，这需要使其更通用，与歧义无关。您需要做的下一件事是调整“join_all_words”规则，使其不依赖于任何给定规则的输入。

https://github.com/tboyarski/BCCRC-Snakemake/blob/master/help/download.svg

我还想感谢您提供了一个 TOP-NOTCH 示例。杰出的！

 NUMBERS = ["1", "2"]
 LETTERS = ["a", "b", "c"]
 WORDS = ["foo", "bar", "baz"]
 CHOICES = ["yes", "no"]


 rule all:
     input:
         expand("results/allthings/all_{choice}.md5sum", choice=CHOICES),
         expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)

 rule make_things:
     output:
         "results/{letter}_{number}/{word}_{choice}.txt"
     shell:
         """
         echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
         """

 rule gather_things:
     input:
         expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
     output:
         "results/allthings/{word}_{choice}.txt"
     wildcard_constraints:
         word='[^(all)][0-9a-zA-Z]*'
     shell:
         """
         cat {input} > {output}
         """

 rule join_all_words:
     input:
         expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
     output:
         "results/allthings/all_{choice}.txt"
     shell:
         """
         cat {input} > {output}
         """

 rule compute_md5:
     input:
         "{pathCMD5}/{sample}.txt"
     output:
         "{pathCMD5}/{sample}.md5sum"
         #"results/allthings/{word}_{choice}.md5sum"
     shell:
         """
         md5sum {input} > {output}

snakemake - 理解和克服snakemake中的AmbiguousRuleException

1 回答 1

控制歧义

compute_md5 泛化性

Related

Reference