1

我有这两个长度相等的字符串,我需要比较它们。我想找到重叠基数(。)和内部间隙(*)。下面是示例:

------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC
-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---
      ................**.................

重叠数 = 33。内部间隙数 = 2。

我没有问题找到重叠的数量。但我很难找到内部差距。以下是我拥有的当前代码。它非常慢。原则上,我需要计算数百万个这样的对。

#!/usr/bin/perl -w
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

print "$s1\n";
print "$s2\n";


my %base = ("A" => 1, "T" => 1, "C" => 1, "G" => 1);

my $ovlp_basecount = 0;
my $internal_gap = 0;

foreach my $si ( 0 .. length($s1)  ) {


    my $base1 = substr($s1,$si,1);
    my $base2 = substr($s2,$si,1);


    # Overlap
    if ( $base{$base1} && $base{$base2} ) {
        $ovlp_basecount++;
    }

    # Not sure how to compute internal gap

}


print "TOTAL OVERLAP BASE = $ovlp_basecount\n";
print "TOTAL Internal Gap \?\n";

请建议我如何有效地找到内部差距和重叠。

4

2 回答 2

3

您可以对字符串使用按位或来查找一个字符串中与另一个字符串中的空白区域重叠的区域。此过程还具有通过将非重叠字符转换为小写来显示重叠的效果,因此也可以非常简单地找到重叠:

#!/usr/bin/perl

use strict;
use warnings;

my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

$s1 =~ tr/-/\x20/;
$s2 =~ tr/-/\x20/;
my $or = $s1 | $s2;
(my $gap) = $or =~ m/^.*[ACTG]([actg]+)[ACTG].*$/;
(my $overlap = $or) =~ s/[^A-Z]//g;

print "s1:      '$s1'\n";
print "s2:      '$s2'\n";
print "OR:      '$or'\n";
printf "Gap:     '%s' (%d)\n", $gap,     length $gap;
printf "Overlap  '%s' (%d)\n", $overlap, length $overlap;

印刷:

s1:      '      ACTAAAAATACAAAAA  TTAGCCAGGCGTGGTGGCAC'
s2:      '     TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG   '
OR:      '     tACTAAAAATACAAAAAaaTTAGCCAGGWGTGGTGGcac'
Gap:     'aa' (2)
Overlap  'ACTAAAAATACAAAAATTAGCCAGGWGTGGTGG' (33)

有关字符串按位运算的更多信息:

http://teaching.idallen.com/cst8214/08w/notes/bit_operations.txt

于 2010-12-04T14:43:49.327 回答
1

假设间隙从不重叠,您可以使用正则表达式解决此问题。这是您的答案s1

echo '------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC' | perl -ne '$s = 0; foreach(/[GTAC](-+)[GTAC]/) { $s += length($1); } print "$s\n";'
2
于 2010-12-04T14:53:36.923 回答