xml - 如何使用 entrez.efetch 获取特定的蛋白质序列？

Question

我正在尝试使用 Biopython 的Entrez.fetch()函数通过基因 id (GI) 编号从 NCBI 获取蛋白质序列。

proteina = Entrez.efetch(db="protein", id= gi, rettype="gb", retmode="xml").

然后我使用以下方法读取数据：

proteinaXML = Entrez.read(proteina).

我可以打印结果，但是我不知道如何单独获得蛋白质序列。

显示结果后，我可以手动获取蛋白质。或者 II 使用以下命令检查 XML 树：

proteinaXML[0]["GBSeq_feature-table"][2]["GBFeature_quals"][6]['GBQualifier_value'].

但是，根据提交的蛋白质的 GI，XML 树可能会有所不同。使这个过程难以稳健地自动化。

我的问题：是否可以只检索蛋白质序列，而不是整个 XML 树？或者：鉴于 XML 文件的结构可能因蛋白质而异，我如何从 XML 文件中提取蛋白质序列？

谢谢

score 2 · Accepted Answer

好点，XML 中的数据库条目确实因不同作者提交的蛋白质而异。

我已经制定了一个算法来从 XML 树中“寻找”蛋白质序列：

import os
import sys
from Bio import Entrez
from Bio import SeqIO

gi          = '1293613'         # example gene id                   
Entrez.email= "you@email.com"   # Always tell NCBI who you are
protina     = Entrez.efetch(db="protein", id=gi, retmode="xml") # fetch the xml
protinaXML  = Entrez.read(protina)[0]

seqs = []           # store candidate protein seqs
def seqScan(xml):   # recursively collect protein seqs
    if str(type(xml))=="<class 'Bio.Entrez.Parser.ListElement'>":
        for ele in xml:
            seqScan(ele)
    elif str(type(xml))=="<class 'Bio.Entrez.Parser.DictionaryElement'>":
        for key in xml.keys():
            seqScan(xml[key])
    elif str(type(xml))=="<class 'Bio.Entrez.Parser.StringElement'>":
        #   v___THIS IS THE KEYWORD FILTER_____v
        if (xml.startswith('M') and len(xml))>10: # 1) all proteins start with M (methionine)
            seqs.append(xml)                      # 2) filters out authors starting with M

seqScan(protinaXML) # run the recursive sequence collection
print(seqs)         # print the goods!

注意：在极少数情况下（取决于“关键字过滤器”），它可能会幽默地抓取不需要的字符串，例如以“M”开头的作者姓名，其缩写名称长度超过 10 个字符（下图）：

在此处输入图像描述

希望有帮助！

xml - 如何使用 entrez.efetch 获取特定的蛋白质序列？

1 回答 1

Related

Reference