这是生物信息学中一个相当普遍的问题,因此您应该使用像BioPython这样内置此功能的生物信息学库。
首先,您使用您的序列创建一个多 FASTA 文件:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag',
'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg',
'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag']
my_records = [SeqRecord(Seq(sequence, generic_dna),
id='Species_{}'.format(letter), description='Species_{}'.format(letter))
for sequence, letter in zip(sequences, 'ABCDE')]
root_dir = r"C:\Users\BioGeek\Documents\temp"
filename = 'my_sequences'
fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename))
SeqIO.write(my_records, fasta_path, "fasta")
这将创建如下C:\Users\BioGeek\Documents\temp\my_sequences.fasta
所示的文件:
>Species_A
ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
>Species_B
ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
>Species_C
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
>Species_D
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
>Species_E
ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
接下来,使用命令行工具ClustalW
进行多序列比对:
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe"
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path)
stdout, stderr = clustalw_cline()
print stdout
这打印:
CLUSTAL 2.1 Multiple Sequence Alignments
Sequence format is Pearson
Sequence 1: Species_A 50 bp
Sequence 2: Species_B 50 bp
Sequence 3: Species_C 50 bp
Sequence 4: Species_D 50 bp
Sequence 5: Species_E 50 bp
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 90
Sequences (1:3) Aligned. Score: 94
Sequences (1:4) Aligned. Score: 88
Sequences (1:5) Aligned. Score: 84
Sequences (2:3) Aligned. Score: 90
Sequences (2:4) Aligned. Score: 84
Sequences (2:5) Aligned. Score: 78
Sequences (3:4) Aligned. Score: 94
Sequences (3:5) Aligned. Score: 82
Sequences (4:5) Aligned. Score: 82
Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd]
There are 4 groups
Start of Multiple Alignment
Aligning...
Group 1: Sequences: 2 Score:912
Group 2: Sequences: 2 Score:921
Group 3: Sequences: 4 Score:865
Group 4: Sequences: 5 Score:855
Alignment Score 2975
CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]
该my_sequences.dnd
文件ClustalW
创建,是一个标准的Newick 树文件,Bio.Phylo
可以解析这些:
from Bio import Phylo
newick_path = os.path.join(root_dir, '{}.dnd'.format(filename))
tree = Phylo.read(newick_path, "newick")
Phylo.draw_ascii(tree)
哪个打印:
____________ Species_A
____|
| |_____________________________________ Species_B
|
_| ____ Species_C
|_________|
| |_________________________ Species_D
|
|__________________________________________________________________ Species_E
或者,如果您已安装matplotlib
或pylab
安装,您可以使用以下功能创建图形draw
:
tree.rooted = True
Phylo.draw(tree, branch_labels=lambda c: c.branch_length)
产生:
这张树状图清楚地说明了您观察到的情况:物种 A 和 B 是同一生物体的变体,并且都从根的共同进化枝分化而来。物种 C 和 D 也是如此,它们都从根的另一个共同进化枝中分化出来。最后,物种 E 与主根不同,因为它与物种 A 到 D 无关。