python - 如何缓存读取？

Question

我正在使用 python/pysam 来分析测序数据。在其命令伙伴的教程（pysam - An interface for reading and writing SAM files）中，它说：

'这种方法对于高通量处理来说太慢了。如果需要与其伙伴一起处理读取，请从读取名称排序的文件中工作，或者更好的是缓存读取。

您将如何“缓存读取”？

score 9 · Accepted Answer

缓存是加速长时间运行操作的典型方法。为了计算速度，它牺牲了内存。

假设您有一个函数，它给定一组参数总是返回相同的结果。不幸的是，这个函数非常慢，你需要调用它相当多的时间来减慢你的程序。

您可以做的是存储有限数量的 {parameters: result} 组合，并在使用相同参数调用函数时跳过其逻辑。

这是一个肮脏的技巧，但非常有效，特别是如果参数组合与函数速度相比较低。

在 Python 3 中有一个用于此目的的装饰器。
在 Python 2 中，一个库可以提供帮助，但您需要做更多的工作。

score 3 · Accepted Answer

AlignmentFile 作为第一个参数：

文件路径或对象

因此，您可以提供一个支持类文件接口的对象，即方法seek、read、，而不是提供文件名tell。在为此实现一个类时，您还可以在读取上实现缓存，这当然必须取决于当前光标位置。

如果文件大小足够小以适合内存，您可以读取完整的文件并对io.BytesIO对象进行操作，无需创建自己的类：

data = io.BytesIO(open('datafile','rb').read())
your_object = AlignmentFile(data, <other args>)

我不确定这是否会加快速度，因为我假设现代操作系统（我知道 linux 会这样做）进行缓存文件访问。所以也许依靠它就足够了。

score 0 · Accepted Answer

我发现其他答案并未解决如何在实践中实际缓存读取。

这是一个简单的方法来做到这一点：

from collections import defaultdict

from pysam import AlignmentFile

def get_mate(read_pairs, read):
    if read.qname not in read_pairs or not (read.is_read1 ^ read.is_read2):
      return None
    pos = 1 if read.is_read1 else 0
    return read_pairs[read.qname][pos]

# maps QNAME to a read pair
read_pairs = defaultdict(lambda : [None, None])

fin = AlignmentFile("your_filepath")

for read in fin.fetch(your_chrom,your_start,your_stop):
    if read.is_paired and (read.is_read1 ^ read.is_read2):
        pos = 0 if read.is_read1 else 1
        read_pairs[read.qname][pos] = read

## Now compare execution time of these two commands
your_read_mate = fin.mate(your_read) # pysam, non-cached
your_read_mate = get_mate(read_pairs, your_read) # cached

其中读取对的操作定义是（参见SAM 格式）：

两个读取具有相同的 QNAME
每次读取都设置了标志 0x1 ( read.is_paired)
每次读取仅设置标志 0x40 ( read.is_read1) 或 0x80 ( read.is_read2) 之一（XORread.is_read1 ^ read.is_read2对此进行检查）

在我的机器上，使用 ipython 的%timeit命令，我得到18.9 ms ± 510 µs了非缓存调用和854 ns ± 28.7 ns给定读取的缓存调用（我知道这对在read_pairs）:-)

python - 如何缓存读取？

3 回答 3

Related

Reference