regex - 如何在较大的文本文件中使用 File::Map 进行正则表达式搜索/替换以避免“内存不足”-错误？

Question

更新 2：已解决。见下文。

我正在将一个大的 txt 文件从一个旧的基于 DOS 的库程序转换为更可用的格式。我刚开始使用 Perl，并设法编写了一个脚本，例如：

BEGIN {undef $/; };
open $in,  '<',  "orig.txt"      or die "Can't read old file: $!"; 
open $out, '>',  "mod.txt"  or die "Can't write new file: $!";
while( <$in> )  
{
$C=s/foo/bar/gm;
print "$C matches replaced.\n"
etc...
print $out $_;
}
close $out;

它非常快，但一段时间后我总是收到“内存不足”-由于缺少 RAM/交换空间而导致的错误（我在 Win XP 上使用 2GB 内存和 1.5GB 交换文件）。在环顾了一下如何处理大文件之后，File::Map在我看来这是避免这个问题的好方法。不过，我在实施它时遇到了麻烦。这就是我现在所拥有的：

#!perl -w
use strict; 
use warnings;
use File::Map qw(map_file);

my $out = 'output.txt';
map_file my $map, 'input.txt', '<';
$map =~ s/foo/bar/gm;

print $out $map;

但是我收到以下错误：Modification of a read-only value attempted at gott.pl line 8.

另外，我在帮助页面上读到File::Map，在非 Unix 系统上我需要使用binmode. 我怎么做？

基本上，我想做的是通过 File::Map “加载”文件，然后运行如下代码：

$C=s/foo/bar/gm;
print "$C matches found and replaced.\n"

$C=s/goo/far/gm;
print "$C matches found and replaced.\n"
while(m/complex_condition/gm)
{ 
$C=s/complex/regex/gm;
$run_counter++;
}
print "$C matches replaced. Script looped $run_counter times.\n";
etc...

我希望我没有忽略一些太明显的东西，但是File::Map帮助页面上给出的示例仅显示了如何从映射文件中读取，对吗？

编辑：

为了更好地说明我目前由于内存不足而无法完成的事情，我将举一个例子：

在http://pastebin.com/6Ehnx6xA上是我们导出的图书馆记录之一的样本（txt 格式）。我对+Deskriptoren:从第 46 行开始的部分感兴趣。这些是主题分类器，按树形层次结构组织。

我想要的是用其完整的父节点链扩展每个分类器，但前提是在所讨论的子节点之前或之后没有父节点不存在。这意味着转

+Deskriptoren
-foo
-Cultural Revolution
-bar

进入

+Deskriptoren
-foo
-History
-Modern History
-PRC
-Cultural Revolution
-bar

当前使用的 Regex 使用 Lookbehind 和 Lookahead 以避免重复重复，因此比s/foo/bar/g;：

s/(?<=\+Deskriptoren:\n)((?:-(?!\QParent-Node\E).+\n)*)-(Child-Node_1|Child-Node_2|...|Child-Node_11)\n((?:-(?!Parent-Node).+\n)*)/${1}-Parent-Node\n-${2}\n${3}/g;

但它有效！直到 Perl 的内存用完... :/

所以本质上我需要一种方法来对一个大文件（80MB）进行多行操作。处理时间不是问题。这就是我想到 File::Map 的原因。另一种选择可能是分几个步骤处理文件，链接的 perl 脚本相互调用然后终止，但我想尽可能地将它保存在一个地方。

更新 2：

我设法让它与下面的 Schwelm 代码一起工作。我的脚本现在调用以下子例程，该子例程调用两个嵌套子例程。示例代码位于： http: //pastebin.com/SQd2f8ZZ

仍然不太满意，因为我不能File::Map上班。哦，好吧...我想无论如何，线路方法更有效。

感谢大家！

score 7 · Accepted Answer

当您将$/（输入记录分隔符）设置为未定义时，您正在“吞食”文件——一次读取文件的全部内容（例如，这在perlvar中进行了讨论）。因此，内存不足的问题。

相反，如果可以的话，一次处理一行：

while (my $line = <$in>){
    # Do stuff.
}

在文件足够小的情况下，并且您确实 slurp 文件，则不需要while循环。第一次阅读得到一切：

{
    local $/ = undef;
    my $file_content = <>;
    # Do stuff with the complete file.
}

更新

在看到您的大量正则表达式后，我会敦促您重新考虑您的策略。将此作为解析问题解决：一次处理一行文件，根据需要存储有关解析器状态的信息。这种方法允许您使用简单、易于理解（甚至可测试）的步骤来处理信息。

你当前的策略——人们可能称之为大肆使用正则表达式策略——难以理解和维护（在 3 个月内你的正则表达式会对你立即有意义吗？），难以测试和调试，并且难以调整如果您发现与您对数据的最初理解有意外偏差。此外，正如您所发现的，该策略容易受到内存限制的影响（因为需要 slurp 文件）。

StackOverflow 上有很多问题说明了当有意义的单元跨越多行时如何解析文本。另请参阅此问题，我在其中向另一位提问者提供了类似的建议。

score 3 · Accepted Answer

一些简单的解析可以将文件分解为可管理的块。算法是：

1. Read until you see `+Deskriptoren:`
2. Read everything after that until the next `+Foo:` line
3. Munge that bit.
4. Goto 1.

这是代码的草图：

use strict;
use warnings;
use autodie;

open my $in,  $input_file;
open my $out, $output_file;

while(my $line = <$in>) {
    # Print out everything you don't modify
    # this includes the +Deskriptoren line.
    print $out $line;

    # When the start of a description block is seen, slurp in up to
    # the next section.
    if( $line =~ m{^ \Q Deskriptoren: }x ) {
        my($section, $next_line) = _read_to_next_section($in);

        # Print the modified description
        print $out _munge_description($section);

        # And the following header line.
        print $out $next_line;
    }
}

sub _read_to_next_section {
    my $in = shift;

    my $section = '';
    my $line;
    while( $line = <$in> ) {
        last if $line =~ /^ \+ /x;
        $section .= $line;
    }

    # When reading the last section, there might not be a next line
    # resulting in $line begin undefined.
    $line = '' if !defined $line;
    return($section, $line);
}

# Note, the +Deskriptoren line is not on $description
sub _munge_description {
    my $description = shift;

    ...whatever you want to do to the description...

    return $description;
}

我还没有测试过，但是类似的东西应该可以做到。与将整个文件作为字符串（File::Map 或其他）处理相比，它具有优势，您可以单独处理每个部分，而不是尝试在一个正则表达式中覆盖每个基础。它还可以让您开发一个更复杂的解析器来处理诸如注释和字符串之类的事情，这些事情可能会使上面的简单解析变得混乱，并且适应大量正则表达式将是一个巨大的痛苦。

score 1 · Accepted Answer

您正在使用 mode <，它是只读的。如果你想修改内容，你需要读写权限，所以你应该使用+<.

如果您在 Windows 上，并且需要二进制模式，那么您应该单独打开文件，在文件句柄上设置二进制模式，然后从该句柄映射。

我还注意到您有一个输入文件和一个输出文件。如果您使用 File::Map，则您正在就地更改文件......也就是说，您无法打开文件进行读取并更改不同文件的内容。您需要复制文件，然后修改副本。我在下面这样做了。

use strict;
use warnings;

use File::Map qw(map_file);
use File::Copy;

copy("input.txt", "output.txt") or die "Cannot copy input.txt to output.txt: $!\n";

open my $fh, '+<', "output.txt"
    or die "Cannot open output.txt in r/w mode: $!\n";

binmode($fh);

map_handle my $contents, $fh, '+<';

my $n_changes = ( $contents =~ s/from/to/gm );

unmap($contents);
close($fh);

的文档对于File::Map如何发出错误信号并不是很好，但从源头来看，似乎$contents未定义是一个很好的猜测。

regex - 如何在较大的文本文件中使用 File::Map 进行正则表达式搜索/替换以避免“内存不足”-错误？

3 回答 3

Related

Reference