python - 使用python根据关键字获取蛋白质FASTA序列

Question

我想用 python 2.7 从 Entrez 收集蛋白质 FASTA 序列。我正在寻找任何具有关键字的蛋白质：名称中的“终止酶”和“大”。到目前为止，我得到了这段代码：

from Bio import Entrez
Entrez.email = "example@example.org"


searchResultHandle = Entrez.esearch(db="protein", term="terminase large", retmax=1000)
searchResult = Entrez.read(searchResultHandle)
ids = searchResult["IdList"]

handle = Entrez.efetch(db="protein", id=ids, rettype="fasta", retmode="text")
record = handle.read()

out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

然而，它可以让我从各种生物体中获得几个终止酶，而我只需要形成噬菌体的终止酶（特别是病毒 [taxid 10239]，宿主细菌。我已经设法从我感兴趣的病毒的 NCBI 获得了 nuccore 登录 ID，但是我不知道如何结合这两个信息。id文件如下所示：

NC_001341
NC_001447
NC_028834
NC_023556
...

我是否需要访问每个 ID 的每个 gb 文件并在其中搜索我想要的蛋白质？

score 1 · Accepted Answer

找到了我要找的东西。在：

searchResultHandle = Entrez.esearch(db="protein", term="terminase large", retmax=1000)

我已经添加：

searchterm = "(terminase large subunit AND viruses[Organism]) AND Caudovirales AND refseq[Filter]"
searchResultHandle = Entrez.esearch(db="protein", term=searchterm, retmax=6000)

这减少了我对所需病毒的搜索。当然，它不是按主机过滤的，而是按分类组过滤的，但这对我的工作来说已经足够了。

感谢@Llopis 提供更多帮助

python - 使用python根据关键字获取蛋白质FASTA序列

1 回答 1

Related

Reference