6

I have a file format (fastq format) that encodes a string of integers as a string where each integer is represented by an ascii code with an offset. Unfortunately, there are two encodings in common use, one with an offset of 33 and the other with an offset of 64. I typically have several 100 million strings of length 80-150 to convert from one offset to the other. The simplest code that I could come up with for doing this type of thing is:

def phred64ToStdqual(qualin):
    return(''.join([chr(ord(x)-31) for x in qualin]))

This works just fine, but it is not particularly fast. For 1 million strings, it takes about 4 seconds on my machine. If I change to using a couple of dicts to do the translation, I can get this down to about 2 seconds.

ctoi = {}
itoc = {}
for i in xrange(127):
    itoc[i]=chr(i)
    ctoi[chr(i)]=i

def phred64ToStdqual2(qualin):
    return(''.join([itoc[ctoi[x]-31] for x in qualin]))

If I blindly run under cython, I get it down to just under 1 second.
It seems like at the C-level, this is simply a cast to int, subtract, and then cast to char. I haven't written this up, but I'm guessing it is quite a bit faster. Any hints including how to better code a this in python or even a cython version to do this would be quite helpful.

Thanks,

Sean

4

1 回答 1

4

如果您查看 urllib.quote 的代码,就会发现与您正在做的事情类似。看起来像:

_map = {}
def phred64ToStdqual2(qualin):
    if not _map:
        for i in range(31, 127):
            _map[chr(i)] = chr(i - 31)
    return ''.join(map(_map.__getitem__, qualin))

请注意,如果映射长度不同(在 urllib.quote 中,您必须采用 '%' -> '%25'.

但实际上,由于每个翻译的长度相同,python 有一个函数可以非常快速地完成此操作:maketranstranslate。你可能不会比:

import string
_trans = None
def phred64ToStdqual4(qualin):
    global _trans
    if not _trans:
        _trans = string.maketrans(''.join(chr(i) for i in range(31, 127)), ''.join(chr(i) for i in range(127 - 31)))
    return qualin.translate(_trans)
于 2010-09-27T16:48:40.020 回答