perl - perl 通过动态编程在序列对齐中爆炸

Question

我正在使用动态编程（半全局对齐）比较大小为 5500 碱基的参考序列和大小为 3600 的查询序列，实际上我对复杂性和性能知之甚少，并且代码正在爆炸并给我错误“记不清”。知道它在较小的序列上正常工作，我的问题是：这种行为是正常的，或者我可能在代码中有另一个问题？如果正常，有任何提示可以解决这个问题吗？提前致谢。

sub semiGlobal {
    my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = @_;
    # initialization: first row to 0 ;
    my @matrix;
    $matrix[0][0]{score}   = 0;
    $matrix[0][0]{pointer} = "none";
    for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
        $matrix[0][$j]{score}   = 0;
        $matrix[0][$j]{pointer} = "none";
    }

    for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
        $matrix[$i][0]{score}   = $GAP * $i;
        $matrix[$i][0]{pointer} = "up";
    }

    # fill
    my $max_i     = 0;
    my $max_j     = 0;
    my $max_score = 0;

    print "seq2: ".length($seq2);
    print "seq1: ".length($seq1);

    for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
        for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
            my ( $diagonal_score, $left_score, $up_score );
            # calculate match score
            my $letter1 = substr( $seq1, $j - 1, 1 );
            my $letter2 = substr( $seq2, $i - 1, 1 );
            if ( $letter1 eq $letter2 ) {
                $diagonal_score = $matrix[ $i - 1 ][ $j - 1 ]{score} +  $MATCH;
            }
            else {
                $diagonal_score = $matrix[ $i - 1 ][ $j - 1 ]{score} +  $MISMATCH;
            }

            # calculate gap scores
            $up_score   = $matrix[ $i - 1 ][$j]{score} +  $GAP;
            $left_score = $matrix[$i][ $j - 1 ]{score} +  $GAP;

            # choose best score
            if ( $diagonal_score >= $up_score ) {
                if ( $diagonal_score >= $left_score ) {
                    $matrix[$i][$j]{score}   = $diagonal_score;
                    $matrix[$i][$j]{pointer} = "diagonal";
                }
                else {
                    $matrix[$i][$j]{score}   = $left_score;
                    $matrix[$i][$j]{pointer} = "left";
                }
            }
            else {
                if ( $up_score >= $left_score ) {
                    $matrix[$i][$j]{score}   = $up_score;
                    $matrix[$i][$j]{pointer} = "up";
                }
                else {
                    $matrix[$i][$j]{score}   = $left_score;
                    $matrix[$i][$j]{pointer} = "left";
                }
            }

            # set maximum score
            if ( $matrix[$i][$j]{score} > $max_score ) {
                $max_i     = $i;
                $max_j     = $j;
                $max_score = $matrix[$i][$j]{score};
            }
        }
    }

    my $align1 = "";
    my $align2 = "";
    my $j = $max_j;
    my $i = $max_i;

    while (1) {
        if ( $matrix[$i][$j]{pointer} eq "none" ) {
            $stseq1 = $j;
            last;
        }

        if ( $matrix[$i][$j]{pointer} eq "diagonal" ) {
            $align1 .= substr( $seq1, $j - 1, 1 );
            $align2 .= substr( $seq2, $i - 1, 1 );
            $i--;
            $j--;
        }
        elsif ( $matrix[$i][$j]{pointer} eq "left" ) {
            $align1 .= substr( $seq1, $j - 1, 1 );
            $align2 .= "-";
            $j--;
        }
        elsif ( $matrix[$i][$j]{pointer} eq "up" ) {
            $align1 .= "-";
            $align2 .= substr( $seq2, $i - 1, 1 );
            $i--;
        }
    }

    $align1 = reverse $align1;
    $align2 = reverse $align2;
    return ( $align1, $align2, $stseq1 ,$max_j);
}

score 2 · Accepted Answer

可能解决问题的一种方法是将@matrix与文件联系起来。但是，这将大大减慢程序的速度。考虑一下：

sub semiGlobal {

    use Tie::Array::CSV; 

    tie my @matrix, 'Tie::Array::CSV', 'temp.txt'; # Don't forget to add your own error handler.


    my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = @_;

    # initialization: first row to 0 ;

    $matrix[0][0] = '0 n';
    for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
        $matrix[0][$j] = '0 n';
    }

    for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {

        my $score = $GAP * $i;
        $matrix[$i][0] = join ' ',$score,'u';
    }

    #print Dumper(\@matrix);

    # fill
    my $max_i     = 0;
    my $max_j     = 0;
    my $max_score = 0;

    print "seq2: ".length($seq2)."\n";
    print "seq1: ".length($seq1)."\n";

    for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {
        for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
            my ( $diagonal_score, $left_score, $up_score );

            # calculate match score
            my $letter1 = substr( $seq1, $j - 1, 1 );
            my $letter2 = substr( $seq2, $i - 1, 1 );
            my $score = (split / /, $matrix[ $i - 1 ][ $j - 1 ])[0];
            if ( $letter1 eq $letter2 ) {
                $diagonal_score = $score +  $MATCH;
            }
            else {
                $diagonal_score = $score +  $MISMATCH;
            }

            # calculate gap scores
            $up_score   = (split / /,$matrix[ $i - 1 ][$j])[0] +  $GAP;
            $left_score = (split / /,$matrix[$i][ $j - 1 ])[0] +  $GAP;

            # choose best score
            if ( $diagonal_score >= $up_score ) {
                if ( $diagonal_score >= $left_score ) {
                    $matrix[$i][$j] = join ' ',$diagonal_score,'d';
                }
                else {
                    $matrix[$i][$j] = join ' ', $left_score, 'l';
                }
            }
            else {
                if ( $up_score >= $left_score ) {
                    $matrix[$i][$j] = join ' ', $up_score, 'u';
                }
                else {
                    $matrix[$i][$j] = join ' ', $left_score, 'l';
                }
            }

            # set maximum score
            if ( (split / /, $matrix[$i][$j])[0] > $max_score ) {
                $max_i     = $i;
                $max_j     = $j;
                $max_score = (split / /, $matrix[$i][$j])[0];

            }
        }
    }


    my $align1 = "";
    my $align2 = "";
    my $stseq1;

    my $j = $max_j;
    my $i = $max_i;

    while (1) {
        my $pointer = (split / /, $matrix[$i][$j])[1];
        if ( $pointer eq "n" ) {
            $stseq1 = $j;
            last;
        }

        if ( $pointer eq "d" ) {
            $align1 .= substr( $seq1, $j - 1, 1 );
            $align2 .= substr( $seq2, $i - 1, 1 );
            $i--;
            $j--;
        }
        elsif ( $pointer eq "l" ) {
            $align1 .= substr( $seq1, $j - 1, 1 );
            $align2 .= "-";
            $j--;
        }
        elsif ( $pointer eq "u" ) {
            $align1 .= "-";
            $align2 .= substr( $seq2, $i - 1, 1 );
            $i--;
        }
    }

    $align1 = reverse $align1;
    $align2 = reverse $align2;

    untie @matrix; # Don't forget to add your own error handler.

    unlink 'temp.txt'; # Don't forget to add your own error handler.

    return ( $align1, $align2, $stseq1 ,$max_j);
}

您仍然可以将原始子用于短序列，并切换到此子用于长序列。

score 1 · Accepted Answer

我认为@j_random_hacker 和@Ashalynd 在大多数 Perl 实现中使用该算法是正确的。您正在使用的数据类型将使用计算所需的更多内存。

所以这是“正常的”，因为你应该期望看到这种内存使用情况，因为你是如何在 perl 中编写这个算法的。您可能在使用大量内存的周围代码中遇到其他问题，但是此算法会因大序列而严重影响您的内存。

您可以按照@Ashalynd 的建议通过更改您正在使用的数据类型来解决一些内存问题。您可以尝试将保存分数和指针的哈希更改为数组并将字符串指针更改为整数值。这样的事情可能会给您带来一些好处，同时仍然保持可读性：

use strict;
use warnings;

# define constants for array positions and pointer values
# so the code is still readable.
# (If you have the "Readonly" CPAN module you may want to use it for constants
# instead although none of the downsides of the "constant" pragma apply in this code.)
use constant {
  SCORE => 0,
  POINTER => 1,

  DIAGONAL => 0,
  LEFT => 1,
  UP => 2,
  NONE => 3,
};

...

sub semiGlobal2 {
  my ( $seq1, $seq2,$MATCH,$MISMATCH,$GAP ) = @_;

  # initialization: first row to 0 ;
  my @matrix;

  # score and pointer are now stored in an array
  # using the defined constants as indices
  $matrix[0][0][SCORE]   = 0;

  # pointer value is now a constant integer
  $matrix[0][0][POINTER] = NONE;

  for ( my $j = 1 ; $j <= length($seq1) ; $j++ ) {
      $matrix[0][$j][SCORE]   = 0;
      $matrix[0][$j][POINTER] = NONE;
  }

  for ( my $i = 1 ; $i <= length($seq2) ; $i++ ) {

      $matrix[$i][0][SCORE]   = $GAP * $i;
      $matrix[$i][0][POINTER] = UP;
  }

... # continue to make the appropriate changes throughout the code

但是，当我对此进行测试时，尝试在 5500 字符的随机数据字符串中对齐 3600 字符的字符串时并没有获得巨大的好处。我将我的代码编程为在消耗超过 2GB 内存时中止。原始代码在 23 秒后中止，而使用常量和数组而不是哈希的代码在 32 秒后中止。

如果你真的想使用这个特定的算法，我会检查Algorithm::NeedlemanWunsch的性能。它看起来不是很成熟，但它可能已经解决了您的性能问题。否则，请考虑围绕 C 实现编写Inline或Perl XS包装器

perl - perl 通过动态编程在序列对齐中爆炸

2 回答 2

Related

Reference