regex - 用于提取多行块的 perl 正则表达式

Question

我有这样的文字：

00:00 stuff
00:01 more stuff
multi line
  and going
00:02 still 
    have

所以，我没有一个块结束，只是一个新的块开始。

我想递归地获取所有块：

1 = 00:00 stuff
2 = 00:01 more stuff
multi line
  and going

ETC

下面的代码只给了我这个：

$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';

我究竟做错了什么？

my $text = '00:00 stuff
00:01 more stuff
multi line
 and going
00:02 still 
have
    ';
my @array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(@array);

score 4 · Accepted Answer

5.10.0 版引入了命名捕获组，可用于匹配非平凡模式。

(?'NAME'pattern)
(?<NAME>pattern)

一个命名的捕获组。在每个方面都与普通的捕获括号相同，()但另外一个事实是该组可以在各种正则表达式构造（例如\g{NAME}）中按名称引用，并且可以在成功匹配后通过名称访问%+or %-。有关和哈希的更多详细信息，请参见perlvar 。%+%-

如果多个不同的捕获组具有相同的名称，则$+{NAME}将引用匹配中最左侧定义的组。

形式(?'NAME'pattern)和(?<NAME>pattern)是等价的。

命名捕获组允许我们在正则表达式中命名子模式，如下所示。

use 5.10.0;  # named capture buffers

my $block_pattern = qr/
  (?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc))

  (?(DEFINE)
    # timestamp at logical beginning-of-line
    (?<_time> (?m:^) [0-9][0-9]:[0-9][0-9])

    # runs of spaces or tabs
    (?<_sp> [ \t]+)

    # description is everything through the end of the record
    (?<_desc>
      # s switch makes . match newline too
      (?s: .+?)

      # terminate before optional whitespace (which we remove) followed
      # by either end-of-string or the start of another block
      (?= (?&_sp)? (?: $ | (?&_time)))
    )
  )
/x;

像这样使用它

my $text = '00:00 stuff
00:01 more stuff
multi line
 and going
00:02 still
have
    ';

while ($text =~ /$block_pattern/g) {
  print "time=[$+{time}]\n",
        "desc=[[[\n",
        $+{desc},
        "]]]\n\n";
}

输出：

$ ./blocks-demo
时间=[00:00]
描述=[[[
东西
]]]

时间=[00:01]
描述=[[[
更多东西
多线
 去
]]]

时间=[00:02]
描述=[[[
仍然
有
]]]

score 3 · Accepted Answer

这应该可以解决问题。下一个\d\d:\d\d的开头被视为块结束。

use strict;

my $Str = '00:00 stuff
00:01 more stuff
multi line
  and going
00:02 still 
    have
00:03 still 
    have' ;

my @Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);

print join "--\n", @Blocks;

score 0 · Accepted Answer

你的问题.*?是非贪婪的方式与贪婪的方式相同.*。当它不被强制时，它会尽可能少地匹配，在这种情况下是空字符串。

所以，在非贪婪比赛之后，你需要一些东西来巩固你的捕获。我想出了这个正则表达式：

my @array = $text =~ m/\n?([0-9]{2}:[0-9]{2}.*?)(?=\n[0-9]{2}:|$)/gs;

如您所见，我删除了/m能够在前瞻断言中准确匹配字符串结尾的选项。

你也可以考虑这个解决方案：

my @array = split /(?=[0-9]{2}:[0-9]{2})/, $text;

regex - 用于提取多行块的 perl 正则表达式

3 回答 3

Related

Reference