perl - perl 使用触发器在 SAME 分隔符之间提取文本

Question

在过去我有不同的开始和结束时，我已经能够使用触发器来提取文本。这次我在尝试提取文本时遇到了很多麻烦，因为我的源文件中没有不同的分隔符，因为触发器的 START 和 END 是相同的。我希望触发器在年份为 yyyy 的线存在时开始正确并继续推$_送到一个数组，直到另一条线开始 yyyy。触发器的问题是它会在我的下一次开始时是错误的。

while (<SOURCEFILE>) {
  print if (/^2017/ ... /^2017/) 
}

对给定的源数据使用上述内容将错过我也需要匹配的文件的第二个多行部分。也许我认为是解析多行文件的最佳方法的触发器在这种情况下不起作用？我想要做的是开始匹配以日期开头的第一行并继续匹配，直到下一行之前的行以日期开头。

样本数据是：

2017 message 1
Text
Text

Text

2017 message 2
more text
more text

more text

2017 message 3
yet more text
yet more text

yet more text

但我得到：

2017 message 1
Text
Text

Text

2017 message 2
2017 message 3
yet more text
yet more text

yet more text

...缺少消息 2 内容..

我不能依赖源数据中的空格或不同的 END 分隔符。我想要的是打印每条消息（实际上push @myarray, $_然后测试匹配），但是在这里我缺少消息 2 下面的行，因为触发器设置为 false。有什么方法可以用触发器来处理这个问题，还是我需要使用其他东西？提前感谢任何可以提供帮助/建议的人。

score 2 · Accepted Answer

这是一种方法：

use Modern::Perl;
use Data::Dumper;
my $part = -1;
my $parts;
while(<DATA>) {
    chomp;
    if (/^2017/ .. 1==0) {
        $part++ if /^2017/;
        push @{$parts->[$part]}, $_;
    }
}
say Dumper$parts;

__DATA__
2017 message 1
Text
Text

Text

2017 message 2
more text
more text

more text

2017 message 3
yet more text
yet more text

yet more text

输出：

$VAR1 = [
          [
            '2017 message 1',
            'Text',
            'Text',
            '',
            'Text',
            ''
          ],
          [
            '2017 message 2',
            'more text',
            'more text',
            '',
            'more text',
            ''
          ],
          [
            '2017 message 3',
            'yet more text',
            'yet more text',
            '',
            'yet more text'
          ]
        ];

score 1 · Accepted Answer

我不知道如何用触发器做到这一点。一年前我试过了。但是我用一些逻辑做了同样的事情。

my $line_concat;
my $f = 0;
while (<DATA>) {
    if(/^2017/ && !$f) {
        $f = 1;
    }

    if (/^2017/) {
        print "$line_concat\n" if $line_concat ne "";
        $line_concat = "";
    }

    $line_concat .= $_ if $f;
}

print $line_concat if $line_concat ne "";

score 1 · Accepted Answer

正如您所发现的，带有匹配分隔符的触发器不能很好地工作。

您是否考虑过设置$/？

例如：

#!/usr/bin/env perl
use strict;
use warnings; 

local $/ = "2017 message";
my $count;

while ( <DATA> ) {

    print "\nStart of block:", ++$count, "\n";

    print;

    print "\nEnd of block:", $count, "\n";
}

__DATA__
2017 message 1
Text
Text

Text

2017 message 2
more text
more text

more text

2017 message 3
yet more text
yet more text

yet more text

虽然它并不完美，因为它在分隔符上分割文件 - 这意味着在第一个之前有一个“位”（所以你得到 4 个块）。您可以通过明智地使用 'chomp' 来重新拼接它，它会$/从当前块中删除：

#!/usr/bin/env perl
use strict;
use warnings; 

local $/ = "2017 message";
my $count;

while ( <DATA> ) {
    #remove '2017 message'
    chomp;
    #check for empty (first) block
    next unless /\S/;
    print "\nStart of block:", ++$count, "\n";
    #re add '2017 message'
    print $/;
    print;

    print "\nEnd of block:", $count, "\n";
}

或者，数组数组怎么样，每次点击消息时更新“目标键”？

#!/usr/bin/env perl
use strict;
use warnings; 

use Data::Dumper;

my %messages; 
my $message_id;
while ( <DATA> ) {
   chomp;
   if ( m/2017 message (\d+)/ ) { $message_id = $1 }; 
   push @{ $messages{$message_id} }, $_; 
}

print Dumper \%messages;

注意 - 我使用的是散列，而不是数组，因为这对于不是从零开始连续的消息排序更加健壮。（并且使用这种方法的数组将有一个空的“第零”元素）。

注意 - 它也将为''您的空白行提供“空”元素。如果你愿意，你可以过滤这些。

score 1 · Accepted Answer

您只需要一个缓冲区来累积行，直到找到一个匹配/^20\d\d[ ]/或文件结尾。

my $in = 0;
my @buf;
while (<>) {
   if ($in && /^20\d\d[ ]/) {
      process(@buf);
      @buf = ();
      $in = 0;
   }

   push @buf, $_ if $in ||= /^2017[ ]/;
}

process(@buf) if $in;

我们可以重新排列代码以使其仅在一个位置处理记录，从而允许process内联。

my $in = 0;
my @buf;
while (1) {
   $_ = <>;

   if ($in && (!defined($_) || /^20\d\d[ ]/)) {
      process(@buf);
      @buf = ();
      $in = 0;
   }

   last if !defined($_);

   push @buf, $_ if $in ||= /^2017[ ]/;
}

perl - perl 使用触发器在 SAME 分隔符之间提取文本

4 回答 4

Related

Reference