我最近注意到,我用 Perl 编写的一个用于 10MB 以下文件的快速脚本已被修改、重新分配并用于 40MB 以上的文本文件,在批处理环境中存在严重的性能问题。
遇到大型文本文件时,这些作业每次运行大约运行 12 小时,我想知道如何提高代码的性能?我是否应该将文件啜饮到内存中,如果我这样做会破坏工作对文件中行号的依赖。任何建设性的想法将不胜感激,我知道这项工作在文件中循环了太多次,但如何减少这种情况?
#!/usr/bin/perl
use strict;
use warnings;
my $filename = "$ARGV[0]"; # This is needed for regular batch use
my $cancfile = "$ARGV[1]"; # This is needed for regular batch use
my @num =();
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
push (@num, $.) if (/^P\|/)
}
close FILE;
my $start;
my $end;
my $loop = scalar(@num);
my $counter =1;
my $test;
open (OUTCANC, ">>$cancfile") || error ("Could not open file: ($!)");
#Lets print out the letters minus the CANCEL letters
for ( 1 .. $loop )
{
$start = shift(@num) if ( ! $start );
$end = shift(@num);
my $next = $end;
$end--;
my $exclude = "FALSE";
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
my $line = $_;
$test = $. if ( eof );
if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
{
print OUTCANC "$line";
$exclude = "TRUECANC";
next;
}
if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
{
print OUTCANC "$line";
} elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
print $_;
}
}
close FILE;
$end = ++$test if ( $end < $start );
$start = $next if ($next);
}
#Lets print the last letter in the file
my $exclude = "FALSE";
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
my $line = $_;
if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
{
$exclude = "TRUECANC";
next;
}
if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
{
print OUTCANC "$line";
} elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
print $_;
}
}
close FILE;
close OUTCANC;
#----------------------------------------------------------------
sub message
{
my $m = shift or return;
print("$m\n");
}
sub error
{
my $e = shift || 'unknown error';
print("$0: $e\n");
exit 0;
}