regex - 提高 Perl 搜索文件脚本的性能

Question

我最近注意到，我用 Perl 编写的一个用于 10MB 以下文件的快速脚本已被修改、重新分配并用于 40MB 以上的文本文件，在批处理环境中存在严重的性能问题。

遇到大型文本文件时，这些作业每次运行大约运行 12 小时，我想知道如何提高代码的性能？我是否应该将文件啜饮到内存中，如果我这样做会破坏工作对文件中行号的依赖。任何建设性的想法将不胜感激，我知道这项工作在文件中循环了太多次，但如何减少这种情况？

#!/usr/bin/perl
use strict;
use warnings;

my $filename = "$ARGV[0]"; # This is needed for regular batch use 
my $cancfile = "$ARGV[1]"; # This is needed for regular batch use 
my @num =();
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
    push (@num, $.) if (/^P\|/)
}
close FILE;

my $start;
my $end;

my $loop = scalar(@num);
my $counter =1;
my $test;

open (OUTCANC, ">>$cancfile") || error ("Could not open file: ($!)");

#Lets print out the letters minus the CANCEL letters
for ( 1 .. $loop )
{
    $start = shift(@num) if ( ! $start );
    $end = shift(@num);
    my $next = $end;
    $end--;
    my $exclude = "FALSE";

    open(FILE, "<", "$filename") || error("Cannot open file ($!)");
    while (<FILE>)
    {
        my $line = $_;
        $test = $. if ( eof );
        if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
        {
            print OUTCANC "$line";
            $exclude = "TRUECANC";
            next;
        }
        if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
        {
            print OUTCANC "$line";
        } elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
            print $_;
        }
    }
    close FILE;
    $end = ++$test if ( $end < $start );
    $start = $next if ($next);
}


#Lets print the last letter in the file

my $exclude = "FALSE";

open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
    my $line = $_;
    if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
    {
        $exclude = "TRUECANC";
        next;
    }
    if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
    {
        print OUTCANC "$line";
    } elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
        print $_;
    }
}
close FILE;
close OUTCANC;


#----------------------------------------------------------------

sub message
{
    my $m = shift or return;
    print("$m\n");
}

sub error
{
    my $e = shift || 'unknown error';
    print("$0: $e\n");
    exit 0;
}

score 2 · Accepted Answer

有些事情可以加快脚本的速度，比如删除不必要的正则表达式使用。

/^P\|/相当于"P|" eq substr $_, 0, 2。
$foo =~ "BAR"可能是-1 != index $foo, "BAR"。

然后是一些重复的代码。将其分解到 sub 本身不会提高性能，但可以更容易地推断脚本的行为。

有很多不必要的字符串化，比如"$filename"-$filename单独就足够了。

但最严重的违规者将是：

for ( 1 .. $loop ) {
  ...
  open FILE, "<", $filename or ...
  while (<FILE>) {
    ...
  }
  ...
}

您只需要读取该文件一次，最好读取到一个数组中。您可以遍历索引：

for ( 1 .. $loop ) {
  ...
  for my $i (0 .. $#file_contents) {
    my $line = $file_contents[$i];
    ... # swap $. for $i, but avoid off-by-one error
  }
  ...
}

磁盘 IO 很慢，所以尽可能缓存！

我还看到您将$exclude变量用作具有值的布尔值FALSE和TRUECANC. 为什么不0and 1，所以您可以直接在条件中使用它？

您可以在 if/elsif 中排除常见测试：

if    (FOO && BAR) { THING_A }
elsif (FOO && BAZ) { THING_B }

应该

if (FOO) {
    if    (BAR) { THING_A }
    elsif (BAZ) { THING_B }
}

该$. == $start && $line =~ /^P\|.../测试可能很愚蠢，因为$start只包含以开头的行数P|- 所以这里的正则表达式可能就足够了。

编辑

如果我正确理解了脚本，那么以下应该会显着提高性能：

#!/usr/bin/perl
use strict;
use warnings;

my ($filename, $cancfile) = @ARGV;
open my $fh, "<", $filename or die "$0: Couldn't open $filename: $!";

my (@num, @lines);
while (<$fh>)
{
    push @lines, $_;
    push @num, $#lines if "P|" eq substr $_, 0, 2;
}

open my $outcanc, ">>", $cancfile or die "$0: Couldn't open $cancfile: $!";

for my $i ( 0 .. $#num )
{
    my $start = $num[$i];
    my $end   = ($num[$i+1] // @lines) - 1;
    # pre v5.10:
    # my $end = (defined $num[$i+1] ? $num[$i+1] : @lines) - 1

    if ($lines[$start] =~ /^P[|][0-9]{9}[|]1I[|]IR[|]/) {
        print {$outcanc} @lines[$start .. $end];
    } else {
        print STDOUT     @lines[$start .. $end];
    }
}

脚本已清理。该文件缓存在一个数组中。只有数组中真正需要的部分被迭代——我们从之前的O(n · m)下降到了O(n ) 。

对于您未来的脚本：证明围绕循环和变异变量的行为并非不可能，但既乏味又烦人。意识到这一点

for (1 .. @num) {
  $start = shift @num unless $next;  # aka "do this only in the first iteration"
  $next = shift @num:
  $end = $next - 1:
  while (<FH>) {
    ...
    $test = $. if eof
    ...
  }
  $end = ++test if $end < $start;
  $start = $next if $next;
}

实际上就是undef在 2nd 中规避一个可能shift需要一些时间。我们可以在循环之后选择行号，而不是在内部循环中测试eof，所以我们不需要$test. 然后我们得到：

$start = shift @num;
for my $i (1 .. @num) {
  $end = $num[$i] - 1:

  while (<FH>) { ... }

  $end = $. + 1 if $end < $start;  # $end < $start only true if not defined $num[$i]
  $start = $num[$i] if $num[$i];
}

向下平移$i1 后，我们将越界问题限制在一个点：

for my $i (0 .. $#num) {
  $start = $num[$i];
  $end = $num[$i+1] - 1; # HERE: $end = -1 if $i == $#num

  while (<FH>) { ... }
}
$end = $. + 1 if $end < $start;

将文件读取替换为数组后（注意，数组索引和行号之间存在差异），我们看到如果我们将那个迭代拉到循环中，可以避免最终的文件读取循环for，因为我们知道总共有多少行。可以这么说，我们做到了

$end = ($num[$i+1] // $last_line_number) - 1;

希望我清理的代码确实等同于原始代码。

regex - 提高 Perl 搜索文件脚本的性能

1 回答 1

编辑

Related

Reference