perl - 优化 Perl 脚本 - 在 40GB+ 文件上运行太慢

Question

我制作了以下 Perl 脚本来处理工作中的一些文件操作，但它现在运行速度太慢而无法投入生产。

我不太了解 Perl（不是我的语言之一），所以有人可以帮我识别和替换这个脚本中处理约 4000 万行的速度会很慢的部分吗？

输入的数据格式为：

col1|^|col2|^|col3|!|
col1|^|col2|^|col3|!|
... 40 million of these.

date_cols 数组在脚本的这一部分之前计算，基本上保存包含转换前格式的日期的列的索引。

这是将为每个输入行执行的脚本部分。我已经对其进行了一些清理并添加了评论，但如果需要其他任何内容，请告诉我：

## Read from STDIN until no more lines are arailable.
while (<STDIN>)
{       
    ## Split by field delimiter
    my @fields = split('\|\^\|', $_, -1);   

    ## Remove the terminating delimiter from the final field so it doesn't
    ## interfere with date processing.
    $fields[-1] = (split('\|!\|', $fields[-1], -1))[0];

    ## Cycle through all column numbres in date_cols and convert date
    ##  to yyyymmdd
    foreach $col (@date_cols)
    {
        if ($fields[$col] ne "")
        {
            $fields[$col] = formatTime($fields[$col]);
        }
    }

    print(join('This is an unprintable ASCII control code', @fields), "\n");
}           

## Format the input time to yyyymmdd from 'Dec 26 2012 12:00AM' like format.
sub formatTime($)
{
    my $col = shift;        

    if (substr($col, 4, 1) eq " ") {
        substr($col, 4, 1) = "0";
    }       
    return substr($col, 7, 4).$months{substr($col, 0, 3)}.substr($col, 4, 2);
}

score 3 · Accepted Answer

如果纯粹是为了提高效率而编写的，我会这样编写您的代码：

sub run_loop {
  local $/ = "|!|\n"; # set the record input terminator
                      # to the record seperator of our problem space
  while (<STDIN>) {       
    # remove the seperator
    chomp;

    # Split by field delimiter
    my @fields = split m/\|\^\|/, $_, -1;

    # Cycle through all column numbres in date_cols and convert date
    #  to yyyymmdd
    foreach $col (@date_cols) {
      if ($fields[$col] ne "") {
        # $fields[$col] = formatTime($fields[$col]);
        my $temp = $fields[$col];
        if (substr($temp, 4, 1) eq " ") {
          substr($temp, 4, 1) = "0";
        }       
        $fields[$col] = substr($temp, 7, 4).$months{substr($temp, 0, 3)}.substr($temp, 4, 2);
      }
    }
    print join("\022", @fields) . "\n";
  }
}

优化如下：

用于chomp删除|!|\n末尾的字符串
内联formatTime子。

Perl 中的子程序调用非常昂贵。如果必须非常有效地使用 subs，可以使用&subroutine(@args)语法禁用原型检查。如果@args省略，则当前参数@_对被调用的子可见。这可能会导致错误或额外的性能。明智地使用。该goto &subroutine;语法也可以使用，但这会干扰return（基本上是尾调用）。不使用。

进一步的优化可能包括删除哈希查找%months，因为哈希是昂贵的。

score 2 · Accepted Answer

您必须对您的数据集进行基准比较才能进行比较，但您可以对其使用正则表达式。（您的非常不友好的正则表达式字段和记录分隔符使情况变得更糟！）

my $i = 0;
our %months = map { $_ => sprintf('%02d', ++$i) } qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);

while (<DATA>) {
  s! \|\^\| !\022!xg;  # convert field separator
  s/ \| !\| $ //xg;        # strip record terminator
  s/\b(\w{3}) ( \d|\d\d) (\d{4}) \d\d:\d\d[AP]M\b/${3} . $months{$1} . sprintf('%02d', $2) /eg;
  print;
}

@date_cols如果非字段之一与日期正则表达式匹配，则不会执行您想要的操作。

score 0 · Accepted Answer

在我的工作中，有时我需要 grep 来自 350 多个前端的错误日志等。我使用我调用“SMP grep”的脚本模板；）它很简单：

stat文件，获取文件长度
获取“块长度”= file_length / num_processors
Andjust 块开始和结束，因此它们在“\n”处开始/结束。只是read()，找到“\n”并计算偏移量。
fork()使 num_processor 工作人员，每个工作在自己的块上

如果您在 grep 或其他 CPU 操作中使用正则表达式（我认为是您的情况），这会有所帮助。管理员抱怨这个脚本会消耗磁盘吞吐量，但如果服务器有 8 个 CPU，它是这里唯一的瓶颈 =) 此外，显然如果您需要解析 1 周的数据，您可以在服务器之间划分。

如果有兴趣，明天我可以发布代码。

perl - 优化 Perl 脚本 - 在 40GB+ 文件上运行太慢

3 回答 3

Related

Reference