perl - 如何使用perl中的数组匹配两个序列

Question

在循环两个数组时，我对如何将指针移动通过一个循环但在另一个循环中保持不变感到困惑。例如：

阵列 1：A T C G T C G A G C G
阵列 2：A C G T C C T G T C G

所以第一个数组中的 A 与第二个数组中的 A 匹配，所以我们继续下一个元素。但由于 T 与第二个索引中的 C 不匹配，我希望程序将该 T 与数组 2 中的下一个 G 进行比较，依此类推，直到找到匹配的 T。

my ($array1ref, $array2ref) = @_;

my @array1 = @$array1ref;
my @array2= @$array2ref;
my $count = 0; 
foreach my $element (@array1) {
 foreach my $element2 (@array2) {
 if ($element eq $element2) {
 $count++;
  }else { ???????????


}

score 3 · Accepted Answer

您可以使用while循环来搜索匹配项。如果找到匹配项，请在两个数组中前进。如果不这样做，请推进第二个阵列。最后，您可以打印第一个数组中剩余的不匹配字符：

# [1, 2, 3] is a reference to an anonymous array (1, 2, 3)
# qw(1, 2, 3) is shorthand quoted-word for ('1', '2', '3')
my $arr1 = [qw(A T C G T C G A G C G)];
my $arr2 = [qw(A C G T C C T G T C G)];

my $idx1 = 0;
my $idx2 = 0;

# Find matched characters
# @$arr_ref is the size of the array referenced by $arr_ref
while ($idx1 < @$arr1 && $idx2 < @$arr2) {
    my $char1 = $arr1->[$idx1];
    my $char2 = $arr2->[$idx2];
    if ($char1 eq $char2) {
        # Matched character, advance arr1 and arr2
        printf("%s %s  -- arr1[%d] matches arr2[%d]\n", $char1, $char2, $idx1, $idx2);
        ++$idx1;
        ++$idx2;
    } else {
        # Unmatched character, advance arr2
        printf(". %s  -- skipping arr2[%d]\n", $char2, $idx2);
        ++$idx2;
    }
}

# Remaining unmatched characters
while ($idx1 < @$arr1) {
    my $char1 = $arr1->[$idx1];
    printf("%s .  -- arr1[%d] is beyond the end of arr2\n", $char1, $idx1);
    $idx1++;
}

脚本打印：

A A  -- arr1[0] matches arr2[0]
. C  -- skipping arr2[1]
. G  -- skipping arr2[2]
T T  -- arr1[1] matches arr2[3]
C C  -- arr1[2] matches arr2[4]
. C  -- skipping arr2[5]
. T  -- skipping arr2[6]
G G  -- arr1[3] matches arr2[7]
T T  -- arr1[4] matches arr2[8]
C C  -- arr1[5] matches arr2[9]
G G  -- arr1[6] matches arr2[10]
A .  -- arr1[7] is beyond the end of arr2
G .  -- arr1[8] is beyond the end of arr2
C .  -- arr1[9] is beyond the end of arr2
G .  -- arr1[10] is beyond the end of arr2

score 2 · Accepted Answer

嵌套循环没有意义。您不想多次循环任何内容。

您没有指定重新同步后想要发生的事情，因此您需要从以下内容开始并根据您的需要进行调整。

my ($array1, $array2) = @_;

my $idx1 = 0;
my $idx2 = 0;
while ($idx1 < @$array1 && $idx2 < @$array2) {
   if ($array1->[$idx1] eq $array2->[$idx2]) {
      ++$idx1;
      ++$idx2;
   } else {
      ++$idx2;
   }
}

...

照原样，上面的代码片段将留$idx1在它不能（最终）重新同步的最后一个索引处。相反，如果您想在第一次重新同步后立即停止，您希望

my ($array1, $array2) = @_;

my $idx1 = 0;
my $idx2 = 0;
my $mismatch = 0;
while ($idx1 < @$array1 && $idx2 < @$array2) {
   if ($array1->[$idx1] eq $array2->[$idx2]) {
      last if $mismatched;          
      ++$idx1;
      ++$idx2;
   } else {
      ++$mismatched;
      ++$idx2;
   }
}

...

score 0 · Accepted Answer

如果您保证array2 始终与array1 一样长或更长，那么您似乎可以使用“grep”轻松完成此操作。像这样的东西：

sub align
{
    my ($array1, $array2) = @_;
    my $index = 0;

    return grep
           {
               $array1->[$index] eq $array2->[$_] ? ++$index : 0
           } 0 .. scalar( @$array2 ) - 1;
}

基本上，grep 的意思是“将与 array1 中的连续元素匹配的递增索引列表返回给我。”

如果你用这个测试代码运行上面的代码，你可以看到它返回了预期的对齐数组：

my @array1 = qw(A T C G T C G A G C G);
my @array2 = qw(A C G T C C T G T C G);

say join ",", align \@array1, \@array2;

这将输出预期的映射：0,3,4,7,8,9,10。该列表表示@array1[0 .. 6]对应于@array2[0,3,4,7,8,9,10]。

（注意：您需要use Modern::Perl或类似的才能使用say。）

现在，您还没有真正说出您需要的操作输出是什么。我假设你想要这个映射数组。如果您只需要计算在与@array2对齐时跳过的元素数量@array1，您仍然可以使用grep上面的方法，但不是列表，只是return scalar(@$array2) - $index在最后。

score 0 · Accepted Answer

循环不会削减它：我们要么想要在foreach两个数组中都有可用元素时循环，要么遍历所有索引，我们可以根据需要递增：

EL1: while (defined(my $el1 = shift @array1) and @array2) {
  EL2: while(defined(my $el2 = shift @array2)) {
    ++$count and next EL1 if $el1 eq $el2; # break out of inner loop
  }
}

或者

my $j = 0; # index of @array2
for (my $i = 0; $i <= $#array1; $i++) {
  $j++ until $j > $#array or $array1[$i] eq $array2[$j];
  last if $j > $#array;
  $count++;
}

或任何组合。

score 0 · Accepted Answer

这是为了使 for 循环的条件复杂化，请改用 while 循环

my ($array1ref, $array2ref) = @_;

my @array1 = @$array1ref;
my @array2= @$array2ref;
my $count = 0;
my ($index, $index2) = (0,0);
#loop while indexs are in arrays
while($index <= @#array1 && $index2 <= @#array2) { 
    if($array1[$index] eq $array2[$index2]) {
        $index++;
        $index2++;
    } else {
        #increment index until we find a match
        $index2++ until $array1[$index] eq $array2[$index2];
    }
}

score 0 · Accepted Answer

这是一种可能性。它将使用索引来遍历两个列表。

my @array1 = qw(A T C G T C G A G C G);
my @array2 = qw(A C G T C C T G T C G);

my $count = 0;
my $idx1 = 0;
my $idx2 = 0;

while(($idx1 < scalar @array1) && ($idx2 < scalar @array2)) {
    if($array1[$idx1] eq $array2[$idx2]) {
        print "Match of $array1[$idx1] array1 \@ $idx1 and array2 \@ $idx2\n";
        $idx1++;
        $idx2++;
        $count++;
    } else {
        $idx2++;
    }
}

print "Count = $count\n";

score 0 · Accepted Answer

您可能知道，您的问题称为Sequence Alignment。有成熟的算法可以有效地做到这一点，CPAN 上提供了一个这样的模块 Algorithm::NeedlemanWunsch。这是您如何将其应用于您的问题的方法。

#!/usr/bin/perl

use Algorithm::NeedlemanWunsch;

my $arr1 = [qw(A T C G T C G A G C G)];
my $arr2 = [qw(A C G T C C T G T C G)];

my $matcher = Algorithm::NeedlemanWunsch->new(sub {@_==0 ? -1 : $_[0] eq $_[1] ? 1 : -2});

my (@align1, @align2);
my $result = $matcher->align($arr1, $arr2,
  {
   align   => sub {unshift @align1, $arr1->[shift]; unshift @align2, $arr2->[shift]},
   shift_a => sub {unshift @align1, $arr1->[shift]; unshift @align2,            '.'},
   shift_b => sub {unshift @align1,            '.'; unshift @align2, $arr1->[shift]},
  });

print join("", @align1), "\n";
print join("", @align2), "\n";

根据我们在构造函数中指定的成本打印出一个最优解：

ATCGT.C.GAGCG
A.CGTTCGG.TCG

与原始问题中的方法非常不同，但我认为值得了解。

perl - 如何使用perl中的数组匹配两个序列

7 回答 7

Related

Reference