regex - 为什么我使用这些 Raku 正则表达式会得到不同的回溯？

Question

我意外地回溯+了 Raku 正则表达式的量词。

在这个正则表达式中：

'abc' ~~ m/(\w+) {say $0}  <?{ $0.substr(*-1) eq 'b' }>/;

say $0;

我得到了预期的结果：

｢abc｣  # inner say
｢ab｣   # inner say

｢ab｣   # final say

也就是说，（贪婪）+量词获取所有字母，然后条件失败。之后，它通过释放最后一个得到的字母开始回溯，直到条件评估为真。

但是，当我将量词放在捕获组之外时，回溯的工作方式似乎不同：

'abc' ~~ m/[(\w)]+ {say $0}  <?{ $0.tail eq 'b' }>/;

say $0;

结果：

[｢a｣ ｢b｣ ｢c｣]  # inner say
[｢a｣ ｢b｣ ｢c｣]  # why this extra inner say? Shouldn't this backtrack to [｢a｣ ｢b｣]?
[｢a｣ ｢b｣ ｢c｣]  # why this extra inner say? Shouldn't this backtrack to [｢a｣ ｢b｣]?
[｢b｣ ｢c｣]      # Since we could not successfully backtrack, We go on matching by increasing the position
[｢b｣ ｢c｣]      # Previous conditional fails. We get this extra inner say
[｢c｣]          # Since we could not successfully backtrack, We go on matching by increasing the position

Nil            # final say, no match because we could not find a final 'b'

这种行为是预期的吗？如果是这样：为什么它们的工作方式不同？是否可以模仿第一个正则表达式但仍将量词保留在捕获组之外？

笔记：

使用惰性量词“解决”问题......这是预期的，因为差异似乎发生在回溯中，而惰性量词不会发生这种情况。

'abc' ~~ m/[(\w)]+? {say $0}  <?{ $0.tail eq 'b' }>/;

[｢a｣]
[｢a｣ ｢b｣]

[｢a｣ ｢b｣]

但是出于性能原因，我宁愿使用贪婪的量词（这个问题中的例子是一个简化）。

score 7 · Accepted Answer

我认为问题不在于回溯。但看起来中间$0暴露的保留了之前的迭代捕获。考虑这个表达式，

'abc' ~~ m/[(\w)]+ {say "Match:",$/.Str,";\tCapture:",$0}  <?{ False }>/;

这是输出：

Match:abc;  Capture:[｢a｣ ｢b｣ ｢c｣]
Match:ab;   Capture:[｢a｣ ｢b｣ ｢c｣]
Match:a;    Capture:[｢a｣ ｢b｣ ｢c｣]
Match:bc;   Capture:[｢b｣ ｢c｣]
Match:b;    Capture:[｢b｣ ｢c｣]
Match:c;    Capture:[｢c｣]

如您所见，匹配顺序正确，abc ab a .... 但匹配的捕获数组ab也是[｢a｣｢b｣｢c｣]. 我怀疑这是一个错误。

对于您的情况，有几种方法。

仅$/用于条件检查

'abc' ~~ m/[(\w)]+  <?{ $/.Str.substr(*-1) eq 'b' }>/;

或者，另外还使用限定词捕获组。
```
'abc' ~~ m/([(\w)]+) <?{ $0[0][*-1] eq 'b' }>/;
```
这里$0匹配外部组，$0[0]匹配第一个内部组，$[0][*-1]匹配本次迭代中最终匹配的字符。

regex - 为什么我使用这些 Raku 正则表达式会得到不同的回溯？

1 回答 1

Related

Reference