perl - 如何读取具有不同行分隔符的大文件？

Question

我有两个非常大的 XML 文件，它们具有不同类型的行尾。文件 A 在每个 XML 记录的末尾都有 CR LF。文件 B 在每个 XML 记录的末尾只有 CR。

为了正确读取文件 B，我需要将内置 Perl 变量 $/ 设置为“\r”。但是，如果我对文件 A 使用相同的脚本，则该脚本不会读取文件中的每一行，而是将其作为单行读取。

如何使脚本与具有各种行尾分隔符的文本文件兼容？在下面的代码中，脚本正在读取 XML 数据，然后使用正则表达式根据特定的 XML 标记记录结束标记（如 <\record>）拆分记录。最后，它将请求的记录写入文件。

 open my $file_handle, '+<', $inputFile or die $!;  
local $/ = "\r";
while(my $line = <$file_handle>) { #read file line-by-line. Does not load whole file into memory.
    $current_line = $line;

    if ($spliceAmount > $recordCounter) { #if the splice amount hasn't been reached yet
        push (@setofRecords,$current_line); #start adding each line to the set of records array
        if ($current_line =~ m|$recordSeparator|) { #check for the node to splice on
            $recordCounter ++; #if the record separator was found (end of that record) then increment the record counter
        }
    } 
    #don't close the file because we need to read the last line

}
$current_line =~/(\<\/\w+\>$)/;
$endTag = $1;
print "\n\n";
print "End Tag: $endTag \n\n";

close $file_handle;

score 1 · Accepted Answer

虽然您可能不需要它来解析 .xml，但理论上您应该使用 xml 解析器。我推荐XML::LibXM或者从XML::Simple开始。

score 0 · Accepted Answer

如果文件不是太大而无法保存在内存中，您可以将整个内容转换为一个标量，然后使用适当灵活的正则表达式将其拆分为正确的行。例如，

local $/ = undef;
my $data = <$file_handle>;
my @lines = split /(?>\r\n)|(?>\r)|(?>\n)/, $data;
foreach my $line (@lines) {
    ...
}

使用前瞻断言(?>...)可以像常规运算符一样保留行尾字符<>。如果你只是想吃掉它们，你可以通过传递/\r\n|\r|\n/给自己来节省一步split。

perl - 如何读取具有不同行分隔符的大文件？

2 回答 2

Related

Reference