我是第一次使用 Biopython。我有来自未知生物的序列数据,并试图使用 BLAST 来判断它们最有可能来自哪个生物。我编写了以下函数来做到这一点:
def find_organism(file):
"""
Receives a fasta file with a single seq, and uses BLAST to find
from which organism it was taken.
"""
# get seq from fasta file
seqRecord = SeqIO.read(file,"fasta")
# run BLAST
blastResult = NCBIWWW.qblast("blastn", "nt", seqRecord.seq)
# get first hit
blastRecord = NCBIXML.read(blastResult)
firstHit = blastRecord.alignments[0]
# get hit's gi number
title = firstHit.title
gi = title.split("|")[1]
# search NCBI for the gi number
ncbiResult = Entrez.efetch(db="nucleotide", id=gi, rettype="gb", retmode="text")
ncbiResultSeqRec = SeqIO.read(ncbiResult,"gb")
# get organism
annotatDict = ncbiResultSeqRec.annotations
return(annotatDict['organism'])
它工作正常,但需要大约 2 分钟来检索每个物种的有机体,这对我来说似乎很慢。我只是想知道我是否可以做得更好。我知道我可能会创建 NCBI 的本地副本以提高性能,我可能会这样做。但是,我怀疑先查询 BLAST,然后获取 id 并使用它来查询 Entrez 不是要走的路。您还有其他改进建议吗?
谢谢!