1

我有几个类似于下面的文件,我正在尝试进行图像中提到的数字分析

数字分析方法

 >File Sample
 attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg
 aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

我必须映射每个大小为 2 的子字符串,然后将其映射到 33 值以用于不同的 ptoperties,然后根据窗口大小 5 添加。

    my  %temp = (
                 aCount => {
                        aa =>2
                 }
                 cCount => {
                        aa => 0
                 }
    );

我目前的实施包括如下,

   while (<FILE>) {
     my $line = $_;
     chomp $line;

     while ($line=~/(.{2})/og) {
        $subStr = $1;
        if (exists $temp{aCount}{$subStr}) {

          push @{$temp{aCount_array}},$temp{aCount}{$subStr};

          if (scalar(@{$temp{aCount_array}}) == $WINDOW_SIZE) {

                my $sum = eval (join('+',@{$temp{aCount_array}}));
                shift @{$temp{aCount_array}};
                #Similar approach has been taken to other 33 rules
          }

        }

        if (exists $temp{cCount}{$subStr}) {
             #similar approach 
        }

        $line =~s/.{1}//og;
     }
   }

有没有其他方法可以提高整个过程的速度

4

1 回答 1

0

正则表达式很棒,但是当您只需要固定宽度的子字符串时,它们可能会有点过分。替代品是substr

$len = length($line);
for ($i=0; $i<$len; $i+=2) {
   $subStr = substr($line,$i,2);
   ...
}

或者unpack

foreach $subStr (unpack "(A2)*", $line) {
   ...
}

我不知道其中任何一个会比正则表达式快多少,但我知道我会如何找到.

于 2013-02-06T16:17:33.650 回答