我刚开始使用 perl,我有一个问题。我有 PHYLIP 文件,我需要将其转换为 FASTA。我开始写脚本。首先,我删除了行中的空格,现在我需要对齐每行中应该是 60 个氨基酸的所有行,并且序列标识符应该打印在新行中。也许有人可以给我一些建议?
问问题
2841 次
2 回答
6
BioPerl Bio::AlignIO模块可能会有所帮助。它支持PHYLIP序列格式:
phylip2fasta.pl
use strict;
use warnings;
use Bio::AlignIO;
# http://doc.bioperl.org/bioperl-live/Bio/AlignIO.html
# http://doc.bioperl.org/bioperl-live/Bio/AlignIO/phylip.html
# http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format
my ($inputfilename) = @ARGV;
die "must provide phylip file as 1st parameter...\n" unless $inputfilename;
my $in = Bio::AlignIO->new(-file => $inputfilename ,
-format => 'phylip',
-interleaved => 1);
my $out = Bio::AlignIO->new(-fh => \*STDOUT ,
-format => 'fasta');
while ( my $aln = $in->next_aln() ) {
$out->write_aln($aln);
}
$ perl phylip2fasta.pl test.phylip
>Turkey/1-42
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
>Salmo_gair/1-42
AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
>H._Sapiens/1-42
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
>Chimp/1-42
AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
>Gorilla/1-42
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
test.phylip http://evolution.genetics.washington.edu/phylip/doc/sequence.html
5 42
Turkey AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp AAACCCTTGC CGTTACGCTT
Gorilla AAACCCTTGC CGGTACGCTT
GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA
于 2013-03-14T02:22:20.893 回答
1
如果您可以访问 BioPerl,我建议使用它(请参阅其他答案)。如果没有,这是我几年前在旧硬件作业中使用的快速脚本。它可能对你有用。
需要注意的一件事:它将整个 fasta 序列打印在一行上,因此您可以在最后编辑 print 语句以每行打印 70 AA。
#!/usr/bin/perl
use warnings;
use strict;
<DATA> =~ /(\d+)/; # first number is number of species
my $num_species = $1;
my $i = 0;
my @species;
my @acids;
# first $num_species rows have the species name
for ($i = 0; $i < $num_species; $i++) {
my @line = split /\s+/, <DATA>;
chomp @line;
push @species, shift (@line);
push @acids, join ("", @line);
}
# Get the rest of the AAs
$i = 0;
while (<DATA>) {
chomp;
$_ =~ s/\r//g; #remove \r
next if !$_;
$_ =~ s/\s+//g; #remove spaces
$acids[$i] .= $_;
$i = ++$i % $num_species;
}
# Print them
for ($i = 0; $i < $num_species; $i++) {
print "> ", $species[$i], "\n";
# uncomment next line if you want to remove the gaps ("-")
$acids[$i] =~ s/-//g;
print $acids[$i], "\n\n";
}
# Simple PHYLIP Amino Acid file
__DATA__
10 234
Cow MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Carp MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL
Chicken MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL
Human MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL
Loach MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL
Mouse MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL
Rat MAYPFQLGLQ DATSPIMEEL TNFHDHTLMI VFLISSLVLY IISLMLTTKL
Seal MAYPLQMGLQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Whale MAYPFQLGFQ DAASPIMEEL LHFHDHTLMI VFLISSLVLY IITLMLTTKL
Frog MAHPSQLGFQ DAASPIMEEL LHFHDHTLMA VFLISTLVLY IITIMMTTKL
THTSTMDAQE VETIWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
TNKYILDSQE IEIVWTILPA VILVLIALPS LRILYLMDEI NDPHLTIKAM
S-SNTVDAQE VELIWTILPA IVLVLLALPS LQILYMMDEI DEPDLTLKAI
TNTNISDAQE METVWTILPA IILVLIALPS LRILYMTDEV NDPSLTIKSI
TNMYILDSQE IEIVWTVLPA LILILIALPS LRILYLMDEI NDPHLTIKAM
THTSTMDAQE VETIWTILPA VILIMIALPS LRILYMMDEI NNPVLTVKTM
THTSTMDAQE VETIWTILPA VILILIALPS LRILYMMDEI NNPVLTVKTM
THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEV NNPSLTVKTM
TNTNLMDAQE IEMVWTIMPA ISLIMIALPS LRILYLMDEV NDPHLTIKAI
GHQWYWSYEY TDYEDLSFDS YMIPTSELKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TDYENLGFDS YMVPTQDLAP GQFRLLETDH RMVVPMESPV
GHQWYWTYEY TDFKDLSFDS YMTPTTDLPL GHFRLLEVDH RIVIPMESPI
GHQWYWTYEY TDYGGLIFNS YMLPPLFLEP GDLRLLDVDN RVVLPIEAPI
GHQWYWSYEY TDYENLSFDS YMIPTQDLTP GQFRLLETDH RMVVPMESPI
GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
GHQWYWSYEY TDYEDLNFDS YMIPTQELKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TDYEDLSFDS YMIPTSDLKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TNYEDLSFDS YMIPTNDLTP GQFRLLEVDN RMVVPMESPT
RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSSRPG LYYGQCSEIC
RVLVSAEDVL HSWAVPSLGV KMDAVPGRLN QAAFIASRPG VFYGQCSEIC
RVIITADDVL HSWAVPALGV KTDAIPGRLN QTSFITTRPG VFYGQCSEIC
RMMITSQDVL HSWAVPTLGL KTDAIPGRLN QTTFTATRPG VYYGQCSEIC
RILVSAEDVL HSWALPAMGV KMDAVPGRLN QTAFIASRPG VFYGQCSEIC
RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
RMLISSEDVL HSWAIPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMTMRPG LYYGQCSEIC
RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSTRPG LFYGQCSEIC
RLLVTAEDVL HSWAVPSLGV KTDAIPGRLH QTSFIATRPG VFYGQCSEIC
GSNHSFMPIV LELVPLKYFE KWSASML--- ----
GANHSFMPIV VEAVPLEHFE NWSSLMLEDA SLGS
GANHSYMPIV VESTPLKHFE AWSSL----- -LSS
GANHSFMPIV LELIPLKIFE M-------GP VFTL
GANHSFMPIV VEAVPLSHFE NWSTLMLKDA SLGS
GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
GSNHSFMPIV LELVPLSHFE KWSTSML--- ----
GSNHSFMPIV LELVPLEVFE KWSVSML--- ----
GANHSFMPIV VEAVPLTDFE NWSSSML-EA SL--
输出:
> Cow
MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML
> Carp
MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQEIEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDSYMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLNQAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS
> Chicken
MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLSSNTVDAQEVELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDSYMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLNQTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSLLSS
> Human
MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEMGPVFTL
> Loach
MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQEIEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDSYMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLNQTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS
> Mouse
MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI
> Rat
MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI
> Seal
MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDSYMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML
> Whale
MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML
> Frog
MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQEIEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDSYMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLHQTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSMLEASL
于 2013-03-14T15:29:30.877 回答