python - 使用 BCBio 的 GFF 解析器解析不正确

Question

我正在尝试使用 BCBio 的 GFF 解析器，希望可以将它用于我的工具。我从 NCBI 的 RefSeq 数据库中获取了一个测试 .gbk 文件，并用它来解析为一个 .gff 文件。

我使用的代码（来自http://biopython.org/wiki/GFF_Parsing）：

#!/usr/bin/python
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
    in_file = "/var/www/localhost/NC_009925.gbk"
    out_file = "/var/www/localhost/output/your_file.gff"
    in_handle = open(in_file)
    out_handle = open(out_file, "w")

    GFF.write(SeqIO.parse(in_handle, "genbank"), out_handle)

    in_handle.close()
    out_handle.close()

convert_to_GFF3()

以下是部分结果：

##gff-version 3
##sequence-region NC_009925.1 1 6503724
NC_009925.1 annotation  remark  1   6503724 .   .   .   accessions=NC_009925;comment=PROVISIONAL REFSEQ: This record has not yet been subject to final%0ANCBI review. The reference sequence was derived from CP000828.%0ASource bacteria from Marine Biotechnology Institute Culture%0ACollection%2C Marine Biotechnology Institute%2C 3-75-1 Heita%2C Kamaishi%2C%0AIwate 026-0001%2C Japan.%0ACOMPLETENESS: full length.;data_file_division=CON;date=10-JUN-2013;gi=158333233;keywords=;organism=Acaryochloris marina MBIC11017;references=location: %5B0:6503724%5D%0Aauthors: Swingley%2CW.D.%2C Chen%2CM.%2C Cheung%2CP.C.%2C Conrad%2CA.L.%2C Dejesa%2CL.C.%2C Hao%2CJ.%2C Honchak%2CB.M.%2C Karbach%2CL.E.%2C Kurdoglu%2CA.%2C Lahiri%2CS.%2C Mastrian%2CS.D.%2C Miyashita%2CH.%2C Page%2CL.%2C Ramakrishna%2CP.%2C Satoh%2CS.%2C Sattley%2CW.M.%2C Shimada%2CY.%2C Taylor%2CH.L.%2C Tomo%2CT.%2C Tsuchiya%2CT.%2C Wang%2CZ.T.%2C Raymond%2CJ.%2C Mimuro%2CM.%2C Blankenship%2CR.E. and Touchman%2CJ.W.%0Atitle: Niche adaptation and genome expansion in the chlorophyll d-producing cyanobacterium Acaryochloris marina%0Ajournal: Proc. Natl. Acad. Sci. U.S.A. 105 %286%29%2C 2005-2010 %282008%29%0Amedline id: %0Apubmed id: 18252824%0Acomment:,location: %5B0:6503724%5D%0Aauthors: %0Aconsrtm: NCBI Genome Project%0Atitle: Direct Submission%0Ajournal: Submitted %2817-OCT-2007%29 National Center for Biotechnology Information%2C NIH%2C Bethesda%2C MD 20894%2C USA%0Amedline id: %0Apubmed id: %0Acomment:,location: %5B0:6503724%5D%0Aauthors: Touchman%2CJ.W.%0Atitle: Direct Submission%0Ajournal: Submitted %2827-AUG-2007%29 Pharmaceutical Genomics Division%2C Translational Genomics Research Institute%2C 13208 E Shea Blvd%2C Scottsdale%2C AZ 85004%2C USA%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;source=Acaryochloris marina MBIC11017;taxonomy=Bacteria,Cyanobacteria,Oscillatoriophycideae,Chroococcales,Acaryochloris
NC_009925.1    feature  source  1   6503724 .   +   .   db_xref=taxon:329726;mol_type=genomic DNA;note=type strain of Acaryochloris marina;organism=Acaryochloris marina MBIC11017;strain=MBIC11017
NC_009925.1    feature  gene    931 1581    .   -   .   db_xref=GeneID:5685235;locus_tag=AM1_0001;note=conserved hypothetical protein;pseudo=
NC_009925.1    feature  gene    1627    2319    .   -   .   db_xref=GeneID:5678840;locus_tag=AM1_0003

问题在于第三和第四行：它从 .gbk 中获取完整的标头信息并将其作为一行放入，而应该跳过它。最后两行是正确的（输出文件的其余部分也是如此）。我尝试使用几个不同的 .gbk 文件，都产生相同的结果。

作为参考，这里是 .gbk 文件的开头：

LOCUS       NC_009925            6503724 bp    DNA     circular CON 10-JUN-2013
DEFINITION  Acaryochloris marina MBIC11017 chromosome, complete genome.
ACCESSION   NC_009925
VERSION     NC_009925.1  GI:158333233
DBLINK      Project: 58167
            BioProject: PRJNA58167
KEYWORDS    .
SOURCE      Acaryochloris marina MBIC11017
  ORGANISM  Acaryochloris marina MBIC11017
            Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales;
            Acaryochloris.
REFERENCE   1  (bases 1 to 6503724)
  AUTHORS   Swingley,W.D., Chen,M., Cheung,P.C., Conrad,A.L., Dejesa,L.C.,
            Hao,J., Honchak,B.M., Karbach,L.E., Kurdoglu,A., Lahiri,S.,
            Mastrian,S.D., Miyashita,H., Page,L., Ramakrishna,P., Satoh,S.,
            Sattley,W.M., Shimada,Y., Taylor,H.L., Tomo,T., Tsuchiya,T.,
            Wang,Z.T., Raymond,J., Mimuro,M., Blankenship,R.E. and
            Touchman,J.W.
  TITLE     Niche adaptation and genome expansion in the chlorophyll
            d-producing cyanobacterium Acaryochloris marina
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 105 (6), 2005-2010 (2008)
   PUBMED   18252824
REFERENCE   2  (bases 1 to 6503724)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (17-OCT-2007) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 6503724)
  AUTHORS   Touchman,J.W.
  TITLE     Direct Submission
  JOURNAL   Submitted (27-AUG-2007) Pharmaceutical Genomics Division,
            Translational Genomics Research Institute, 13208 E Shea Blvd,
            Scottsdale, AZ 85004, USA
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence was derived from CP000828.
            Source bacteria from Marine Biotechnology Institute Culture
            Collection, Marine Biotechnology Institute, 3-75-1 Heita, Kamaishi,
            Iwate 026-0001, Japan.
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..6503724
                     /organism="Acaryochloris marina MBIC11017"
                     /mol_type="genomic DNA"
                     /strain="MBIC11017"
                     /db_xref="taxon:329726"
                     /note="type strain of Acaryochloris marina"
     gene            complement(931..1581)
                     /locus_tag="AM1_0001"
                     /note="conserved hypothetical protein"
                     /pseudo
                     /db_xref="GeneID:5685235"
     gene            complement(1627..2319)
                     /locus_tag="AM1_0003"
                     /db_xref="GeneID:5678840"
     CDS             complement(1627..2319)
                     /locus_tag="AM1_0003"
                     /codon_start=1
                     /transl_table=11
                     /product="NUDIX hydrolase"
                         /protein_id="YP_001514406.1"
                     /db_xref="GI:158333234"
                     /db_xref="GeneID:5678840"
                     /translation="MPYTYDYPRPGLTVDCVVFGLDEQIDLKVLLIQRQIPPFQHQWA
                 LPGGFVQMDESLEDAARRELREETGVQGIFLEQLYTFGDLGRDPRDRIISVAYYALIN
                 LIEYPLQASTDAEDAAWYSIENLPSLAFDHAQILKQAIRRLQGKVRYEPIGFELLPQK
                 FTLTQIQQLYETVLGHPLDKRNFRKKLLKMDLLIPLDEQQTGVAHRAARLYQFDQSKY
                 ELLKQQGFNFEV"

有谁知道我该如何解决这个问题？

我使用以下行来过滤掉前两行错误的行：

if "\tannotation\t" in line or "feature\tsource" in line:

这似乎适用于几个测试.gbk。但我仍然很好奇它为什么首先解析那些？

score 1 · Accepted Answer

答案在您链接的 wiki 页面中（http://biopython.org/wiki/GFF_Parsing#Writing_GFF3）。“GFF3Writer 采用 SeqRecord 对象的迭代器，并将每个 SeqFeature 写入 GFF3 行”。SeqRecord从文件中解析的对象.gbk包含此注释，因此它是由作者编写的。在实现中（https://github.com/chapmanb/bcbb/blob/master/gff/BCBio/GFF/GFFOutput.py）你可以看到它在哪里完成：

self._write_annotations(rec.annotations, rec.id, len(rec.seq), out_handle)

在那里您还可以看到为什么该source功能被通过。它与其他特征（基因、CDS）一样只是一个特征，没有单独处理。

我不知道为什么没有选项或参数（至少我没有找到）告诉作者跳过注释。SeqRecords在阅读with时，我不知道有任何参数可以跳过注释SeqIO.parse()。

为了解决您的问题，我SeqRecords单独访问解析，删除注释和源功能。这种方法的一个缺点是需要额外的内存（以及性能损失），因为我正在从初始生成器创建一个列表。最后，我只是将列表解析为 GFF。我不知道这种方法是否比您的方法好得多。

#!/usr/bin/env python
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
    in_file = "input.gbk"
    out_file = "output.gff"
    in_handle = open(in_file)
    out_handle = open(out_file, "w")

    records = []
    for record in SeqIO.parse(in_handle, "genbank"):
        # delete annotations
        record.annotations = {}
        # loop through features to find the source
        for i in range(0,len(record.features)):
            # if found, delete it and stop (only expect one source)
            if(record.features[i].type == "source"):
                record.features.pop(i)
                break
        records.append(record)

    GFF.write(records, out_handle)

    in_handle.close()
    out_handle.close()

convert_to_GFF3()

python - 使用 BCBio 的 GFF 解析器解析不正确

1 回答 1

Related

Reference