python - 在 Python (x, y) 中使用产量将 GenBank 解析为 FASTA

Question

现在我试图定义和记录我自己的函数来做这件事，但我在测试代码时遇到了问题，我实际上不知道它是否正确。我用 BioPython、re 或其他方法找到了一些解决方案，但我真的很想用产量来完成这项工作。

#generator for GenBank to FASTA
def parse_GB_to_FASTA (lines):
    #set Default label
    curr_label = None
    #set Default sequence
    curr_seq = ""
    for line in lines:
        #if the line starts with ACCESSION this should be saved as the beginning of the label
        if line.startswith('ACCESSION'):
            #if the label has already been changed
            if curr_label is not None:
                #output the label and sequence
                yield curr_label, curr_seq
                ''' if the label starts with ACCESSION, immediately replace the current label with
                the next ACCESSION number and continue with the next check'''
            #strip the first column and leave the number
            curr_label = '>' + line.strip()[12:]
        #check for the organism column
        elif line.startswith ('  ORGANISM'):
            #add the organism name to the label line
            curr_label = curr_label + " " + line.strip()[12:]
        #check if the region of the sequence starts
        elif line.startswith ('ORIGIN'):
            #until the end of the sequence is reached
            while line.startswith ('//') is False:
                #get a line without spaces and numbers
                curr_seq += line.upper().strip()[12:].translate(None, '1234567890 ')
    #if no more lines, then give the last label and sequence            
    yield curr_label, curr_seq

score 0 · Accepted Answer

我经常处理非常大的 GenBank 文件，发现（几年前）BioPython 解析器太脆弱，无法通过（当时）成千上万条记录，而不会在不寻常的记录上崩溃。

我编写了一个纯 python(2) 函数来从打开的文件中返回下一条完整记录，读取 1k 块，并让文件指针准备好获取下一条记录。我将它与一个使用此函数的简单迭代器和一个 GenBank Record 类绑定在一起，该类具有一个 fasta(self) 方法来获取 fasta 版本。

YMMV，但是获取下一条记录的函数在这里应该可以插入到您要使用的任何迭代器方案中。就转换为 fasta 而言，您可以使用类似于上面抓取的 ACCESSION 和 ORIGIN 的逻辑，或者您可以使用以下方法获取部分的文本（如 ORIGIN）：

sectionTitle='ORIGIN'    
searchRslt=re.search(r'^(%s.+?)^\S'%sectionTitle,
                     gbrText,re.MULTILINE | re.DOTALL) 
sectionText=searchRslt.groups()[0]

像 ORGANISM 这样的小节需要 5 个空格的左侧垫。

这是我对主要问题的解决方案：

def getNextRecordFromOpenFile(fHandle):
    """Look in file for the next GenBank record
    return text of the record
    """
    cSize =1024
    recFound = False
    recChunks = []
    try:
        fHandle.seek(-1,1)
    except IOError:
        pass
    sPos = fHandle.tell()

    gbr=None
    while True:
        cPos=fHandle.tell()
        c=fHandle.read(cSize)
        if c=='':
            return None
        if not recFound:

            locusPos=c.find('\nLOCUS')
            if sPos==0 and c.startswith('LOCUS'):
                locusPos=0
            elif locusPos == -1:
                continue
            if locusPos>0:
                locusPos+=1
            c=c[locusPos:]
            recFound=True
        else:
            locusPos=0

        if (len(recChunks)>0 and
            ((c.startswith('//\n') and recChunks[-1].endswith('\n'))
             or (c.startswith('\n') and recChunks[-1].endswith('\n//'))
             or (c.startswith('/\n') and recChunks[-1].endswith('\n/'))
             )):
            eorPos=0
        else:
            eorPos=c.find('\n//\n',locusPos)

        if eorPos == -1:
            recChunks.append(c)
        else:
            recChunks.append(c[:(eorPos+4)])
            gbrText=''.join(recChunks)
            fHandle.seek(cPos-locusPos+eorPos)
            return gbrText

python - 在 Python (x, y) 中使用产量将 GenBank 解析为 FASTA

1 回答 1

Related

Reference