perl - File::Slurp 成多 GB 标量 - 如何有效拆分？

Question

我有一个要在 Perl 中处理的多 GB 文件。逐行读取文件需要几分钟；通过 File::Slurp 将其读入标量需要几秒钟。好的。现在，处理标量的每条“线”的最有效方法是什么？我想我应该避免修改标量，例如在处理它时删除每个连续的行，以避免重新分配标量。

我试过这个：

use File::Slurp;
my $file_ref = read_file( '/tmp/tom_timings/tom_timings_15998', scalar_ref => 1  ) ;

for my $line (split /\n/, $$file_ref) {
    # process line
}

这是次分钟：足够但不是很好。有没有更快的方法来做到这一点？（我的记忆力比上帝还多。）

score 6 · Accepted Answer

split除非您开始交换，否则应该非常快。我能看到加速它的唯一方法是编写一个 XS 函数来查找 LF 而不是使用正则表达式。

顺便说一句，您可以通过切换到

while ($$file_ref =~ /\G([^\n]*\n|[^\n]+)/g) {
    my $line = $1;
    # process line
}

说XS功能。如果您不想大吃大喝，请移动语句newSVpvn_flags后面的行。if

SV* next_line(SV* buf_sv) {
    STRLEN buf_len;
    const char* buf = SvPV_force(buf_sv, buf_len);
    char* next_line_ptr;
    char* buf_end;
    SV* rv;

    if (!buf_len)
        return &PL_sv_undef;

    next_line_ptr = buf;
    buf_end = buf + buf_len;
    while (next_line_ptr != buf_end && *next_line_ptr != '\n')
        ++next_line_ptr;

    rv = newSVpvn_flags(buf, next_line_ptr-buf, SvUTF8(buf_sv) ? SVf_UTF8 : 0);

    if (next_line_ptr != buf_end)
        ++next_line_ptr;

    sv_chop(buf_sv, next_line_ptr);
    return rv;  /* Typemap will mortalize */
}

测试方法：

use strict;
use warnings;

use Inline C => <<'__EOC__';

SV* next_line(SV* buf_sv) {
    ...
}

__EOC__

my $s = <<'__EOI__';
foo
bar
baz
__EOI__

while (defined($_ = next_line($s))) {
   print "<$_>\n";
}

perl - File::Slurp 成多 GB 标量 - 如何有效拆分？

1 回答 1

Related

Reference