我有这两个长度相等的字符串,我需要比较它们。我想找到重叠基数(。)和内部间隙(*)。下面是示例:
------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC
-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---
................**.................
重叠数 = 33。内部间隙数 = 2。
我没有问题找到重叠的数量。但我很难找到内部差距。以下是我拥有的当前代码。它非常慢。原则上,我需要计算数百万个这样的对。
#!/usr/bin/perl -w
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";
print "$s1\n";
print "$s2\n";
my %base = ("A" => 1, "T" => 1, "C" => 1, "G" => 1);
my $ovlp_basecount = 0;
my $internal_gap = 0;
foreach my $si ( 0 .. length($s1) ) {
my $base1 = substr($s1,$si,1);
my $base2 = substr($s2,$si,1);
# Overlap
if ( $base{$base1} && $base{$base2} ) {
$ovlp_basecount++;
}
# Not sure how to compute internal gap
}
print "TOTAL OVERLAP BASE = $ovlp_basecount\n";
print "TOTAL Internal Gap \?\n";
请建议我如何有效地找到内部差距和重叠。