perl - 根据 perl 中的输入查找最长的重复字符串（使用子例程）

Question

所以我试图找到给定特定模式的最长重复。到目前为止，我的代码看起来像这样，并且相当接近，但是它并没有完全给出想要的结果：

use warnings;
use strict;    

my $DNA;       
$DNA = "ATATCCCACTGTAGATAGATAGAATATATATATATCCCAGCT" ;
print "$DNA\n" ;
print "The longest AT repeat is " . longestRepeat($DNA, "AT") . "\n" ;
print "The longest TAGA repeat is " . longestRepeat($DNA, "TAGA") . "\n" ;
print "The longest C repeat is " . longestRepeat($DNA, "C") . "\n" ;

sub longestRepeat{

  my $someSequence = shift(@_);  # shift off the first  argument from the list
  my $whatBP       = shift(@_);  # shift off the second argument from the list
  my $match = 0;



        if ($whatBP eq "AT"){
            while ($someSequence =~ m/$whatBP/g) {
            $match = $match + 1;
            }
            return $match;

        }
        if ($whatBP eq "TAGA"){
            while ($someSequence =~ m/$whatBP/g) {
            $match = $match + 1;
            }
            return $match;
        }

        if ($whatBP eq "C"){
            while ($someSequence =~ m/$whatBP/g) {
            $match = $match + 1;
            }
            return $match;
        }
}

它现在所做的只是在序列中查找 TOTAL AT、TAGA、C 的数量。它不是只给我最长的长度，而是总结它们并给我总数。我认为while循环中有问题，但是我不确定。任何帮助将不胜感激。

ps 它还应该以字符串形式显示最长的重复，而不是数字形式（这里可能使用 substr）。

score 2 · Accepted Answer

您的函数无需longestRepeat检查它正在处理的三种情况中的哪一种——通常，当您发现您多次编写完全相同的指令时，这暗示您可以排除重复并由此简化您的程序。考虑以下内容，我已针对功能进行了清理，并出于说明目的进行了评论：

#!/usr/bin/env perl
use warnings;
use strict;    

# no need to declare and define separately; this works fine
# also no need for space before semicolon
my $DNA = "ATATCCCACTGTAGATAGATAGAATATATATATATCCCAGCT";
print "$DNA\n";
print "The longest AT repeat is " . longestRepeat($DNA, "AT") . "\n";
print "The longest TAGA repeat is " . longestRepeat($DNA, "TAGA") . "\n";
print "The longest C repeat is " . longestRepeat($DNA, "C") . "\n";

sub longestRepeat {

  # note that, within a function, @_ is the default argument to shift();
  # hence its absence in the next two lines. (in practice, you're more 
  # likely to see 'shift' in this context without even parentheses, much
  # less the full 'shift(@_)'; be prepared to run into it.)
  my $sequence = shift(); # take the first argument
  my $kmer = shift(); # take the second argument

  # these state variables we'll use to keep track of what we're doing here;
  # $longest_match, a string, will eventually be returned.
  my $longest_matchlen = 0;
  my $longest_match = '';

  # for each match in $sequence of one or more $kmer repeats...
  while ($sequence =~ m@($kmer)+@g) {

    # ...get the length of the match, stored in $1 by the parenthesized
    # capture group, with the '+' quantifier grabbing the longest match 
    # available from each starting point (see `man perlre' for more)...
    my $this_matchlen = length($1);

    # ...and if this match is longer than the longest yet found...
    if ($this_matchlen > $longest_matchlen) {

      # ...store this match's length in $longest_matchlen...
      $longest_matchlen = $this_matchlen;

      # ...and store the match itself in $longest_match.
      $longest_match = $1;

    }; # end of the 'if' statement

  }; # end of the 'while' loop

  # at this point, the longest match we found is in $longest_match; if
  # we found no matches, then $longest_match still contains the empty
  # string we assigned up there before the while loop started, which is
  # the correct result in a case where $kmer never appears in $sequence.
  return $longest_match;
};

你在学习生物信息学，不是吗？我有一些向生物信息学家教授 Perl 的经验，并且我发现该领域的编程技能和人才分布极为广泛，不幸的是，在图表的左侧有一个驼峰——这是一种礼貌的说法，作为一名专业程序员，我见过的大多数生物信息学 Perl 代码确实从不太好到很差。

我提到这一点并不是为了侮辱，而只是为了证实我非常强烈的建议，即在你目前正在学习的任何课程中加入一些计算机科学课程；你对算法的准确表述所涉及的一般概念和思维习惯的了解越多，你就越能准备好应对你所在领域的要求——实际上，比大多数人都准备得更充分。我的经验; 虽然我自己不是生物信息学家，但在与生物信息学家合作时，在我看来，强大的编程背景可能比强大的生物学背景对生物信息学家更有用。

score 0 · Accepted Answer

（从这个问题的副本中粘贴这个）

根据您的子程序的名称，我假设您想在您的序列中找到最长的重复序列。

如果是这样，以下情况如何：

sub longest_repeat {

    my ( $sequence, $what ) = @_;

    my @matches = $sequence =~ /((?:$what)+)/g ;  # Store all matches

    my $longest;
    foreach my $match ( @matches ) {  # Could also avoid temp variable :
                                      # for my $match ( $sequence =~ /((?:$what)+)/g )

        $longest //= $match ;         # Initialize
                                      #  (could also do `$longest = $match
                                      #                    unless defined $match`)

        $longest = $match if length( $longest ) < length( $match );
    }

    return $longest;  # Note this also handles the case of no matches
}

如果您可以理解，以下版本通过 Schwartzian 变换实现了基本相同的功能：

sub longest_repeat {

    my ( $sequence, $what ) = @_;                          # Example:
                                                           # --------------------
    my ( $longest ) = map { $_->[0] }                      # 'ATAT' ...
                        sort { $b->[1] <=> $a->[1] }       # ['ATAT',4], ['AT',2]
                          map { [ $_, length($_) ] }       # ['AT',2], ['ATAT',4]
                            $sequence =~ /((?:$what)+)/g ; # ... 'AT', 'ATAT'

    return $longest ;
}

有些人可能会争辩说这是浪费，sort因为它O(n.log(n))不是，O(n)但你有多种选择。

perl - 根据 perl 中的输入查找最长的重复字符串（使用子例程）

2 回答 2

Related

Reference