python - 在 python 脚本中使用 pysam.TabixFile 来注释读取的振荡处理速度

Question

最初的问题

我正在用python（3.5）编写一个生物信息学脚本，它解析一个表示在基因组上对齐的测序读数的大型（排序和索引） bam文件，将基因组信息（“注释”）与这些读数相关联，并计算遇到的注释类型. 我正在测量我的脚本处理对齐读取的速度（超过 1000 次读取），我获得了以下速度变化：

什么可以解释这种模式？

我的直觉会让我押注一些数据结构，随着它变得越来越密集，它会逐渐变慢，但它会不时扩展。

不过，内存使用似乎并不显着（运行近 2 小时后，我的脚本仍然只使用了我计算机内存的 0.1%，根据htop）。

我的代码是如何工作的（见最后的实际代码）

我正在使用该pysam模块进行 bam 文件解析。AlignmentFile.fetch方法为我提供了一个迭代器，以AlignedSegment对象的形式提供有关连续对齐读取的信息。

我根据它们的对齐坐标和gtf格式的注释文件（使用 bgzip 压缩并使用 tabix 索引）将注释与读取相关联。我使用TabixFile.fetch方法（仍然 from pysam）来获取这些注释，我过滤它们并以frozenset字符串的形式生成它们的摘要（process_annotations，下面未显示，返回这样的 a frozenset），在内部循环的生成器函数中在 AlignedSegment 迭代器上。

我将生成的frozensets 提供给一个Counter对象。计数器能否对观察到的速度行为负责？

我怎样才能知道如何避免这些定期减速？

附加测试

根据评论中的建议，我分析了我的整个分析cProfile，发现在访问注释数据时花费了最多的运行时间（pysam/ctabix.pyx:579(__cnext__)请参阅稍后的调用图），如果我理解正确的话，这是一些与 samtools C 库交互的 Cython 代码. 观察到的减速的原因似乎很难理解。

为了加快我的脚本速度，我尝试了另一种基于pybedtoolspython 接口和 bedtools 的解决方案，它还可以从 gtf 文件 ( https://daler.github.io/pybedtools/index.html ) 中检索注释。

速度

速度提升非常重要。以下是实际的命令和计时结果（两者实际上是并行运行的）：

$ time python3 -m cProfile -o tests/total_pybedtools.prof ~/src/bioinfo_utils/small_RNA_seq_annotate.py -b results/bowtie2/mapped_C_elegans/WT_1_21-26_on_C_elegans_sorted.bam -g annotations/all_annotations.gtf -a "pybedtools" -l total_pybedtools.log > total_pybedtools.out

real    5m48.474s
user    5m48.204s
sys     0m1.336s

$ time python3 -m cProfile -o tests/total_tabix.prof ~/src/bioinfo_utils/small_RNA_seq_annotate.py -b results/bowtie2/mapped_C_elegans/WT_1_21-26_on_C_elegans_sorted.bam -g annotations/all_annotations.gtf.gz -a "tabix" -l total_tabix.log > total_tabix.out

real    195m40.990s
user    194m54.356s
sys     0m47.696s

（需要注意：两种方法的注释结果略有不同。也许我应该检查一下我如何处理坐标。）

速度曲线没有先前观察到的长期下降：

我的速度问题已经解决，但我仍然对为什么基于 tabix 的方法有这些速度下降感兴趣。为此，我添加了“生物信息学”和“samtools”标签。

调用图

作为记录，我在分析结果上使用 gprof2dot 生成了调用图：

$ gprof2dot -f pstats tests/total_pybedtools.prof \
    | dot -Tpng -o tests/total_pybedtools_callgraph.png
$ gprof2dot -f pstats tests/total_tabix.prof \
    | dot -Tpng -o tests/total_tabix_callgraph.png

以下是基于 tabix 方法的调用图：

对于基于 pybedtools 的方法：

编码

这是我当前代码的主要部分：

@contextmanager
def annotation_context(annot_file, getter_type):
    """Yields a function to get annotations for an AlignedSegment."""
    if getter_type == "tabix":
        gtf_parser = pysam.ctabix.asGTF()
        gtf_file = pysam.TabixFile(annot_file, mode="r")
        fetch_annotations = gtf_file.fetch
        def get_annotations(ali):
            """Generates an annotation getter for *ali*."""
            return fetch_annotations(*ALI2POS_INFO(ali), parser=gtf_parser)
    elif getter_type == "pybedtools":
        gtf_file = open(annot_file, "r")
        # Does not work because somehow gets "consumed" after first usage
        #fetch_annotations = BedTool(gtf_file).all_hits
        # Much too slow
        #fetch_annotations = BedTool(gtf_file.readlines()).all_hits
        # https://daler.github.io/pybedtools/topical-low-level-ops.html
        fetch_annotations = BedTool(gtf_file).as_intervalfile().all_hits
        def get_annotations(ali):
            """Generates an annotation list for *ali*."""
            return fetch_annotations(Interval(*ALI2POS_INFO(ali)))
    else:
        raise NotImplementedError("%s not available" % getter_type)
    yield get_annotations
    gtf_file.close()

def main():
    """Main function of the program."""
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument(
        "-b", "--bamfile",
        required=True,
        help="Sorted and indexed bam file containing the mapped reads."
        "A given read is expected to be aligned at only one location.")
    parser.add_argument(
        "-g", "--gtf",
        required=True,
        help="A sorted, bgzip-compressed gtf file."
        "A corresponding .tbi tabix index should exist.")
    parser.add_argument(
        "-a", "--annotation_getter",
        choices=["tabix", "pybedtools"],
        default="tabix",
        help="Method to use to access annotations from the gtf file.")
    parser.add_argument(
        "-l", "--logfile",
        help="File in which to write logs.")
    args = parser.parse_args()
    if not args.logfile:
        logfilename = "%s.log" % args.annotation_getter
    else:
        logfilename = args.logfile
    logging.basicConfig(
        filename=logfilename,
        level=logging.DEBUG)
    INFO = logging.info
    DEBUG = logging.debug
    WARNING = logging.warning
    process_annotations = make_annotation_processor(args.annotation_getter)
    with annotation_context(args.gtf, args.annotation_getter) as get_annotations:
        def generate_annotations(bamfile):
            """Generates annotations for the alignments in *bamfile*."""
            last_t = perf_counter()
            for i, ali in enumerate(bamfile.fetch(), start=1):
                if not i % 1000:
                    now = perf_counter()
                    INFO("%d alignments processed (%.0f alignments / s)" % (
                        i,
                        1000.0 / (now - last_t)))
                    #if not i % 50000:
                    #    gc.collect()
                    last_t = perf_counter()
                yield process_annotations(get_annotations(ali), ali)
        with pysam.AlignmentFile(args.bamfile, "rb") as bamfile:
            annot_stats = Counter(generate_annotations(bamfile))
            print(*reversed(annot_stats.most_common()), sep="\n")
    return 0

（我使用了上下文管理器和其他高阶函数（make_annotation_processor以及这个函数调用的函数），以便更容易在同一个脚本中使用各种注释检索方法。）

python - 在 python 脚本中使用 pysam.TabixFile 来注释读取的振荡处理速度

最初的问题

附加测试

速度

调用图

编码

0 回答 0

Related

Reference