python - 使用 Biopython 通过 BLAST 检索未知序列的详细信息

Question

我是第一次使用 Biopython。我有来自未知生物的序列数据，并试图使用 BLAST 来判断它们最有可能来自哪个生物。我编写了以下函数来做到这一点：

def find_organism(file):
    """
    Receives a fasta file with a single seq, and uses BLAST to find
    from which organism it was taken.
    """
    # get seq from fasta file
    seqRecord = SeqIO.read(file,"fasta")
    # run BLAST
    blastResult = NCBIWWW.qblast("blastn", "nt", seqRecord.seq)
    # get first hit
    blastRecord = NCBIXML.read(blastResult)
    firstHit = blastRecord.alignments[0]
    # get hit's gi number
    title = firstHit.title
    gi = title.split("|")[1]
    # search NCBI for the gi number
    ncbiResult = Entrez.efetch(db="nucleotide", id=gi, rettype="gb", retmode="text")
    ncbiResultSeqRec = SeqIO.read(ncbiResult,"gb")
    # get organism
    annotatDict = ncbiResultSeqRec.annotations
    return(annotatDict['organism'])

它工作正常，但需要大约 2 分钟来检索每个物种的有机体，这对我来说似乎很慢。我只是想知道我是否可以做得更好。我知道我可能会创建 NCBI 的本地副本以提高性能，我可能会这样做。但是，我怀疑先查询 BLAST，然后获取 id 并使用它来查询 Entrez 不是要走的路。您还有其他改进建议吗？
谢谢！

score 1 · Accepted Answer

您可以通过以下方式获得有机体：

[...]
blastResult = NCBIWWW.qblast("blastn", "nt", seqRecord.seq)
blastRecord = NCBIXML.read(blastResult)

first_organism = blastRecord.descriptions[0]

这将至少节省 efetch 查询。无论如何，“blastn”可能需要太长时间，如果您打算大规模查询 NCBI，您将被禁止。

python - 使用 Biopython 通过 BLAST 检索未知序列的详细信息

1 回答 1

Related

Reference