regex - 这个递归正则表达式究竟是如何工作的？

Question

这是对这个问题的跟进。

看看这个模式：

(o(?1)?o)

它匹配任何o长度为 2 ⁿ且 n ≥ 1的序列。
它可以工作，请参阅 regex101.com（添加单词边界以便更好地演示）。
问题是：为什么？

在下文中，字符串的描述（匹配与否）将只是一个粗体数字或描述长度的粗体术语，如2 ⁿ。

分解（添加空格）：

( o (?1)? o )
(           ) # Capture group 1
  o       o   # Matches an o each at the start and the end of the group
              # -> the pattern matches from the outside to the inside.
    (?1)?     # Again the regex of group 1, or nothing.
              # -> Again one 'o' at the start and one at the end. Or nothing.

我不明白为什么这不匹配2n，而是2 ⁿ，因为我会将模式描述为 *an undefined number of o o，相互堆叠。

可视化：

没有递归，2是匹配的：

oo

一次递归，4是匹配：

o  o
 oo

到目前为止，很容易。

两次递归。显然是错误的，因为模式与6不匹配：

o    o
 o  o
  oo

但为什么？它似乎符合模式。

我得出的结论是，这不仅仅是重复的普通模式，否则6将不得不匹配。

但根据regular-expressions.info：

(?P<name>[abc])(?1)(?P>name)像 dos 一样匹配三个字母(?P<name>[abc])[abc][abc]。

和

[abc])(?1){3}[...] 相当于([abc])[abc]{3}

所以它似乎只是简单地重新匹配正则表达式代码，而没有关于捕获组的先前匹配的信息。

有人可以解释并想象为什么这个模式匹配2 ⁿ而没有别的吗？

编辑：

评论中提到过：

我怀疑在自身内部引用捕获组实际上是受支持的情况。

regular-expressions.info 确实提到了该技术：

如果您在它调用的组内发出呼叫，您将拥有一个递归捕获组。

score 4 · Accepted Answer

您正确理解递归。单词边界在这里让你感到困惑。围绕模式要求正\b则表达式引擎仅在字符串前后没有单词字符时才匹配该字符串。

看看递归是如何进行的：

( o      (?1)?         o )  => oo

(?1)然后替换为(o(?1)?o)：

( o   (?>o(?1)?o)?     o )  => oo or oooo

再说一遍：

(o (?>o(?>o(?1)?o)?o)?  o) => oo, oooo, oooooo

请参阅没有单词边界的正则表达式演示。

为什么要(?>...)在上面的示例中添加？ PHP 递归正则表达式中的每个递归级别都是原子的，与 Perl 不同，一旦前一个级别失败，引擎就不会返回到下一个级别。

添加单词边界时，第一个o和最后一个o匹配的字符之前/之后不能有任何其他单词字符。所以，那就ooo 不匹配了。

请参阅逐步解释的递归正则表达式和Word Boundary:\b at rexegg.com。

为什么不oooooo作为一个整体匹配，而是作为ooooand oo？

同样，每个递归级别都是原子的。oooooo是这样匹配的：

(o(?1)?o)匹配第一个o
(?1)?得到扩展，模式现在是，它与输入(o(?>o(?1)?o)?o)中的第二个匹配o
它一直持续(o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o)到不再与输入匹配，回溯发生，我们进入第 6 级，
整个第 6 个递归级别也失败了，因为它无法匹配必要数量的os
这一直持续到可以匹配必要数量的os 的水平。

请参阅正则表达式调试器：

score 2 · Accepted Answer

这或多或少是对 Wiktors 答案的跟进——即使在删除了单词边界之后，我也很难弄清楚为什么oooooo(6) 匹配为ooooand oo，而ooooooo(7) 匹配为oooooo。

以下是它的详细工作原理：

当扩展递归模式时，内部递归是原子的。使用我们的模式，我们可以将其展开为

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)

（在实际模式中，这会再次展开，但这不会改变解释）

以下是字符串的匹配方式 - 首先oooooo(6)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)
o   |ooooo                          <- first o gets matched by first atomic group
o   o   |oooo                       <- second o accordingly
o   o   o   |ooo                    <- third o accordingly
o   o   o   o   |oo                 <- fourth o accordingly
o   o   o   o   oo|                 <- fifth/sixth o by the innermost atomic group
                     ^              <- there is no more o to match, so backtracking starts - innermost ag is not matched, cursor positioned after 4th character
o   o   o   o   xx   o   |o         <- fifth o matches, fourth ag is successfully matched (thus no backtracking into it)
o   o   o   o   xx   o   o|         <- sixth o matches, third ag is successfully matched (thus no backtracking into it)
                           ^        <- no more o, backtracking again - third ag can't be backtracked in, so backtracking into second ag (with matching 3rd 0 times)
o   o                      |oo<oo   <- third and fourth o close second and first atomic group -> match returned  (4 os)

现在ooooooo(7)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)    
o   |oooooo                         <- first o gets matched by first atomic group
o   o   |ooooo                      <- second o accordingly
o   o   o   |oooo                   <- third o accordingly
o   o   o   o   |ooo                <- fourth o accordingly
o   o   o   o   oo|o                <- fifth/sixth o by the innermost atomic group
o   o   o   o   oo  o|              <- fourth ag is matched successfully (thus no backtracking into it)
                         ^          <- no more o, so backtracking starts here, no backtracking into fourth ag, try again 3rd
o   o   o                |ooo<o     <- 3rd ag can be closed, as well as second and first -> match returned (6 os)

regex - 这个递归正则表达式究竟是如何工作的？

2 回答 2

Related

Reference