python - 使用 python/biopython 对整个 genbank 文件的不完整解析

Question

我的脚本的主要目标是将 genbank 文件转换为 gtf 文件。我的问题与从所有CDS 条目中提取 CDS 信息（基因、位置（例如，CDS 2598105..2598404）、codon_start、protein_id、db_xref）有关。我的脚本应该打开/解析一个 genbank 文件，从每个 CDS 条目中提取信息，并将信息写入另一个文件。该脚本不会产生任何错误，但只会在终止之前从 genbank 文件的前 1/2 写入信息。这是我的代码...

import Bio
from Bio import GenBank
from Bio import SeqIO

fileList = ['data_files/e_coli_ref_BA000007.2.gb']
qualies = ['gene', 'protein_id', 'db_xref']

#######################################################DEFINITIONS################################################################
def strip_it(string_name):
    stripers = ['[', ']', '\'', '"']
    for s in stripers:
        string_name = string_name.replace(s, '')
        string_name = string_name.lstrip()
    return string_name

def strip_it_attributes(string_name):
    stripers = ['[', ']', '\'', '"', '{', '}',',']
    for s in stripers:
        string_name = string_name.replace(s, '') 
        string_name = string_name.lstrip() 
        string_name = string_name.replace(': ', '=')
        string_name = string_name.replace(' ', ';')
    return string_name
#---------------------------------------------------------------------------------------------------------------------------------

#######################################################################################################################
for f in fileList:
    nameOut = f.replace('gb', 'gtf')

    with open(f, 'r') as inputFile:
        with open(nameOut, 'w') as outputFile:
            record = next(SeqIO.parse(f, 'genbank'))
            seqid = record.id
            typeName = 'Gene'
            source = 'convert_gbToGFT.py'
            start_codon = 'NA'
            attribute = 'NA'    

            featureCount = 0
            for f in record.features:
                print(f.type)
                string = ''
                if f.type == 'CDS':
                    dic = {}
                    CDS = record.features[featureCount]

                    position = strip_it(str(CDS.location))
                    start = position.split(':')[0]
                    stop = position.split(':')[1].split('(')[0]
                    strand = position.split(':')[1].split('(')[1].replace(')', '')
                    score = '.'

                    for q in qualies:
                        if q in CDS.qualifiers:
                            if q not in dic:
                                dic[q] = ''
                            dic[q] = strip_it(str(CDS.qualifiers[q]))

                    attribute = strip_it_attributes(str(dic))

                    if 'codon_start' in CDS.qualifiers:
                        start_codon = str(int(str(CDS.qualifiers['codon_start'][0]))-1) #need string when finished so it can be added to variable 'string'

                    string = '\t'.join([seqid, source, typeName, start, stop, score, strand, start_codon, attribute])
                    if attribute.count(';') == 2:
                        outputFile.write(string + '\n')

                    featureCount+=1

#---------------------------------------------------------------------------------------------------------------------------------

输出文件的最后一行是：

BA000007.2  convert_gbToGFT.py   Gene  2598104  2598404  .  +  0  protein_i     d=BAB36052.1;db_xref=GI:13362097;gene=ECs2629

基因ECs2629的位置出现在genbank文件的第36094行，但是这个文件的总行数是73498。我已经重新下载了很多次文件，看看是否有下载问题，我已经目测了文件（我觉得它没有错）。我还在另一个同样大的 genbank 文件上尝试了这个脚本，并遇到了相同的问题。

任何人都可以就为什么不解析整个 genbank 文件提供一些建议，我如何修改我的代码以消除这个问题，或者指出另一个可能的解决方案？

（您可以从此处查看 genbank 文件的格式：http ://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html ），但是，我正在使用大肠杆菌genbank 文件（大肠杆菌O157 ：H7 str. Sakai DNA，完整基因组），可在此处找到： http ://www.ncbi.nlm.nih.gov/nuccore/BA000007.2

我正在使用以下内容：Centos 6.7、Python 3.4.3 :: Anaconda 2.3.0（64 位）、Biopython 1.66

[编辑] @Gerrat 建议适用于有问题的文件，但不适用于其他文件。使用http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3和建议的编辑产生约 28 行输出，而我的原始代码输出 2084 行（但是，应该有 4332 行输出）。

score 1 · Accepted Answer

更改此行：

CDS = record.features[featureCount]

至：

CDS = f

您通过“featureCount”索引访问记录来跳过记录（因为特征计数可能是记录的 1/2）。

编辑：详细说明您的评论：

您的原始脚本是错误的（以您使用的方式featureCount）。我的修正是必要的。如果您还有其他问题，则有其他问题。在这种情况下，似乎有 28 条 CDS 记录，属性计数为 2。（我对基因测序一无所知，我只是按脚本中的变量名称进行操作）。当您切换回 usingfeatureCount时，您现在正在查看“类型”不是“CDS”的记录。它是“基因”或“repeat_region”。您正在检查记录的类型，f看看它是否是CDS，然后使用完全不同的记录，record.features[featureCount]。这些不引用相同的记录（检查该记录的 CDS.type - 在大多数情况下它不再是“CDS”）。

score 0 · Accepted Answer

出于好奇，如果您通过更改来遍历每一行会发生什么：

with open(f, 'r') as inputFile:

至

with open("file") as infile:
    for line in infile:
        do_something_with(line)

variable += 1在循环遍历文件中的行并每次执行以查看行号是否符合您的预期之前将一些变量设置为零也会很有趣

score 0 · Accepted Answer

谢谢@Gerrat 的评论。我重新编写了脚本，它运行良好。

import Bio 
from Bio import GenBank
from Bio import SeqIO

fileList = ['F1.gb', 'F2.gb']

for f in fileList:
      with open(f, 'rU') as handle:
         for record in SeqIO.parse(handle, 'genbank'):
            for feature in record.features:
               if feature.type=='CDS':
                  #[extract feature values here]
                  count+=1

   print('You parsed', count, 'CDS features')

python - 使用 python/biopython 对整个 genbank 文件的不完整解析

3 回答 3

Related

Reference