我对 Perl 和 Bioperl 还很陌生,我正在尝试编写一个脚本来识别相同序列的实例。为了实现这一点,我设想了一个包含 2 个 infile 的脚本,第一个是 fasta 格式的多重对齐,第二个是附件文件,它将 fasta id 链接到其他相关信息。我的方法是通读 Bio::SeqIO 的多重比对,并将文件内容放在哈希中,其中序列是键,id 是值,或者在序列共享的情况下,id 数组是值.
我认为它应该看起来像这样:
"AATTTGTTGTTGTACC" => ('Seq1', 'Seq13'),
"TTTCTCTTTCCCAAAG" => 'Seq2',
目前,我相信我被卡住了,因为在序列共享的情况下尝试将第二个 id 推送到数组上时出错(即上面示例中的“Seq13”)。
这是我正在使用的测试多重对齐:
>Seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>Seq2
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>Seq13
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
在我到目前为止编写的代码下方:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Data::Dumper;
my $seqs = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
my $info = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
#open(INFO, '<', $info);
my $inseq = Bio::SeqIO->new(
-file => $seqs,
-format => "fasta",
);
my %hts;
while (my $seq = $inseq->next_seq) {
# print $seq->seq(), "\t", $seq->id, "\n";
if (defined $hts{$seq->seq()}) {
print "Sequence already in hash:\t$seq->id\n";
push @{$hts{$seq->seq()}}, ${$seq->id};
}
else {
$hts{$seq->seq()} = $seq->id;
}
print Dumper \%hts
}
因此,我希望得到一些帮助
1) 我收到一个我不太理解的错误,但我认为这与 push 语句有关 --> 在 ht_sharing 中使用“strict refs”时不能使用字符串 ("Seq1") 作为 ARRAY ref。 pl 第 24 行,第 3 行。
2)当 if 循环外的 print 语句处于活动状态时,它会打印我认为应该的 id(即 Seq1),但在 if 循环内的 print 语句中,相同的调用 $seq->id 会产生一个引用(即 Bio ::Seq=HASH(0x19e7210)->id)。为什么是这样?我不明白为什么打印 $seq->id 在同一个while循环中有不同的输出。
如果有人能提供澄清,我将非常感激,当然,因为有人对最佳实践或解决问题的更好方法的评论仍然很陌生,也很棒。
干杯,安娜