python - 如何根据与序列的同源物从基因组中提取序列？

Question

我有一个在某些物种中具有同源物的序列以及这些同源物的分数。

这是来自 gff 文件的示例记录：

4592637 Beutenbergia_cavernae_DSM_12333 TILL    70731   70780   .   0   .   clst_id=429;SubjectOrganism=Thermofilum_pendens_Hrk_5;SubjectScore=0.343373493975904;SubjectOrganism=Ignicoccus_hospitalis_KIN4_I;SubjectScore=0.323293172690763;SubjectOrganism=Burkholderia_pseudomallei_MSHR346;SubjectScore=0.343373493975904;SubjectOrganism=Burkholderia_mallei_SAVP1;SubjectScore=0.343373493975904;SubjectOrganism=Enterobacter_638;SubjectScore=0.343373493975904;SubjectOrganism=Rickettsia_felis_URRWXCal2;SubjectScore=0.343373493975904;SubjectOrganism=Gemmatimonas_aurantiaca_T_27;SubjectScore=0.343373493975904;SubjectOrganism=Streptomyces_coelicolor;SubjectScore=0.363453815261044;SubjectOrganism=Beutenbergia_cavernae_DSM_12333;SubjectScore=1;SubjectOrganism=Kocuria_rhizophila_DC2201;SubjectScore=0.343373493975904;SubjectOrganism=Rhodococcus_jostii_RHA1;SubjectScore=0.383534136546185;SubjectOrganism=Symbiobacterium_thermophilum_IAM14863;SubjectScore=0.363453815261044;

==>4592637 => NAPP(Nucleic Acid Phylogenetic Profiling database) 序列 ID（不是 genbank id）

==>Beutenbergia_cavernae_DSM_12333 => 序列的物种名称

==>TILL => 序列类型

==>70731 .. 70780 => 序列的开始和结束

==>clst_id=429 => 是这个序列的簇id

==>SubjectOrganism => 序列与其有同源物的物种名称

==>SubjectScore => 该物种序列的同源物分数（Blastn 分数）

我想从SubjectOrganism序列（4592637）有相似之处的地方提取序列。

如何使用 Python 从序列具有同源物的基因组中提取序列？

score 0 · Accepted Answer

您可以简单地将该序列作为一个字符串，然后根据您的需要对其进行切片。例如：

>>> s="abcdefghij"
>>> len(s)
10
>>> s[5:10]
'fghij'
>>>

将其s视为您的完整字符串并替换5:10为您的70731:70780. 希望有帮助！

score 0 · Accepted Answer

从另一个问题，我想你已经想通了。如果是这种情况，StackOverflow鼓励您回答自己的问题，发布并接受它们！反正：

首先，您获取查询序列，将替换id为您的生物体的 id。我发现它使用“Beutenbergia cavernae DSM 12333”查询 NCBI：

from Bio import Entrez
seq = Entrez.efetch(db="nuccore",
                    id="229564415",
                    rettype="fasta",
                    seq_start=70731,
                    seq_stop=70780).readlines()

现在seq包含类似

['>gb|CP001618.1|:70731-70780 Beutenbergia cavernae DSM 12333,'
 'complete genome\n',
 'GCCCGAGTTCCCCGAACCGTGCCGAGGTAGTACTCCACGGGCGAGGGAGT\n',
 '\n']

使用此序列启动 qblast，如另一个问题所示，但将硬编码替换entrez_query为 GFF 文件中的字符串：

from Bio.Blast.NCBIWWW import qblast
results = qblast("blastn",
                 "nr",
                 "".join(seq),
                 entrez_query='Thermofilum_pendens_Hrk_5')

小心，因为有成千上万的查询，NCBI 肯定会禁止您进入队列。

python - 如何根据与序列的同源物从基因组中提取序列？

2 回答 2

Related

Reference