.net - 将 PCRE 递归正则表达式模式转换为 .NET 平衡组定义

Question

PCRE 有一个叫做递归模式的特性，可以用来匹配嵌套的子组。例如，考虑“语法”

Q -> \w | '[' A ';' Q* ','? Q* ']' | '<' A '>'
A -> (Q | ',')*
// to match ^A$.

它可以在 PCRE 中使用模式完成

^((?:,|(\w|\[(?1);(?2)*,?(?2)*\]|<(?1)>))*)$

（示例测试用例：http ://www.ideone.com/L4lHE ）

应该匹配：

abcdefg abc,def,ghi abc,,,def ,,,,,, [abc;] [a,bc;] sss[abc;d] as[abc;d,e] [abc;d,e][fgh;j,k] <abc> [<a>b;<c,d>,<e,f>] <a,b,c> <a,bb,c> <,,,> <> <><> <>,<> a<<<<>>><a>> <<<<<>>>><><<<>>>> <z>[a;b] <z[a;b]> [[;];] [,;,] [;[;]] [<[;]>;<[;][;,<[;,]>]>]

不应该匹配：

<a bc> <abc<de> [a<b;c>;d,e] [a] <<<<<>>>><><<<>>>>> <<<<<>>>><><<<>>> [abc;def;] [[;],] [;,,] [abc;d,e,f] [<[;]>;<[;][;,<[;,]>]]> <z[a;b>]

.NET 中没有递归模式。相反，它为基于堆栈的操作提供平衡组，以匹配简单的嵌套模式。

是否可以将上述 PCRE 模式转换为 .NET Regex 样式？

_{（是的，我知道最好不要为此使用正则表达式。这只是一个理论问题。）}

参考

score 13 · Accepted Answer

递归模式的 .Net 替代方案是堆栈。这里的挑战是我们需要用堆栈来表达它的语法。
这是这样做的一种方法：

语法略有不同的符号

首先，我们需要语法规则（比如问题中的A和Q）。
我们有一个堆栈。堆栈只能包含规则。
在每一步，我们从堆栈中弹出当前状态，匹配我们需要匹配的内容，并将进一步的规则推入堆栈。当我们完成一个状态时，我们不会推送任何东西并返回到之前的状态。

这种表示法介于CFG和Pushdown automaton之间，我们将规则推送到堆栈中。

例子：

让我们从一个简单的例子开始： a ⁿ b ⁿ。这种语言的常用语法是：

S -> aSb | ε

我们可以改写它以适应符号：

# Start with <push S>
<pop S> -> "a" <push B> <push S> | ε
<pop B> -> "b"

用一句话来说：

我们从S堆栈开始。
当我们从堆栈中弹出S时，我们可以：
- 什么都不匹配，或者...
- 匹配“a”，但我们必须将状态推B送到堆栈。这是我们将匹配“b”的承诺。接下来，我们还推动S，以便我们可以根据需要继续匹配“a”。
当我们匹配了足够多的“a”时，我们开始B从堆栈中弹出 s，并为每个匹配一个“b”。
完成后，我们匹配了偶数个“a”和“b”。

或者，更松散地说：

当我们在案例 S 中时，匹配“a”并按下 B，然后按下 S，或者什么都不匹配。
当我们在案例 B 中时，匹配“b”。

在所有情况下，我们都会从堆栈中弹出当前状态。

用 .Net 正则表达式编写模式

我们需要以某种方式代表不同的状态。我们不能选择'1' '2' '3'或'a' 'b' 'c'，因为这些在我们的输入字符串中可能不可用 - 我们只能匹配字符串中存在的内容。
一种选择是对我们的状态进行编号（在上面的示例中，S状态编号为 0，B状态为 1）。
对于状态 S我们可以将字符推入堆栈。为方便起见，我们将从字符串的开头推送第一个字符。同样，我们不在乎这些字符是什么，只关心有多少。

推送状态

在 .Net 中，如果我们想将字符串的前 5 个字符压入堆栈，我们可以这样写：

(?<=(?=(?<StateId>.{5}))\A.*)

这有点令人费解：

(?<=…\A.*)是一直到字符串开头的lookbehind。
当我们开始时，向前看：(?=…). 这样做是为了让我们可以匹配超出当前位置 - 如果我们在位置2，我们前面没有 5 个字符。因此，我们正在回顾并展望未来。
(?<StateId>.{5})将 5 个字符压入堆栈，表示在某些时候我们需要匹配状态 5。

流行状态

根据我们的符号，我们总是从堆栈中弹出顶部状态。这很容易：(?<-StateId>).
但在我们这样做之前，我们想知道那是什么状态——或者它捕获了多少个字符。更具体地说，我们需要明确检查每个状态，例如switch/case块。因此，要检查堆栈当前是否保持状态 5：

(?<=(?=.{5}(?<=\A\k<StateId>))\A.*)

再一次，(?<=…\A.*)一直到开始。
现在我们前进(?=.{5}…)五个字符...
并使用另一个lookbehind(?<=\A\k<StateId>)来检查堆栈是否真的有5 个字符。

这有一个明显的缺点——当字符串太短时，我们无法表示大状态的数量。这个问题有解决方案：

无论如何，语言中的短词数量是最终的，因此我们可以手动添加它们。
使用比单个堆栈更复杂的东西——我们可以使用多个堆栈，每个堆栈都有零个或一个字符，有效地代表我们的状态（最后有一个示例）。

结果

我们的 a ⁿ b ⁿ模式如下所示：

\A
# Push State A, Index = 0
(?<StateId>)
(?:
    (?:
        (?:
            # When In State A, Index = 0
            (?<=(?=.{0}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State B, Index = 1
                (?<=(?=(?<StateId>.{1}))\A.*)
                a
                # Push State A, Index = 0
                (?<StateId>)
                |

            )
        )
        |
        (?:
            # When In State B, Index = 1
            (?<=(?=.{1}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            b
        )
        |\Z
    ){2}
)+
\Z
# Assert state stack is empty
(?(StateId)(?!))

Regex Storm 上的工作示例

笔记：

请注意，状态周围的量词是(?:(?:…){2})+- 即（状态计数）×（输入长度）。我还为\Z. 让我们不要深入探讨，但它是 .Net 引擎中烦人的优化的一种解决方法。
同样可以写成(?<A>a)+(?<-A>b)+(?(A)(?!))——这只是一个练习。

回答问题

问题的语法可以重写为：

# Start with <push A>
<pop A> -> <push A> ( @"," | <push Q> ) | ε
<pop Q> -> \w
           | "<" <push Q2Close> <push A>
           | "[" <push Q1Close> <push QStar> <push Q1Comma> <push QStar> <push Q1Semicolon> <push A>
<pop Q2Close> -> ">"
<pop QStar> -> <push QStar> <push Q> | ε 
<pop Q1Comma> -> ","?
<pop Q1Semicolon> -> ";"
<pop Q1Close> -> "]"

图案：

\A
# Push State A, Index = 0
(?<StateId>)
(?:
    (?:
        (?:
            # When In State A, Index = 0
            (?<=(?=.{0}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State A, Index = 0
                (?<StateId>)
                (?:
                    ,
                    |
                    # Push State Q, Index = 1
                    (?<=(?=(?<StateId>.{1}))\A.*)
                )
                |

            )
        )
        |
        (?:
            # When In State Q, Index = 1
            (?<=(?=.{1}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                \w
                |
                <
                # Push State Q2Close, Index = 2
                (?<=(?=(?<StateId>.{2}))\A.*)
                # Push State A, Index = 0
                (?<StateId>)
                |
                \[
                # Push State Q1Close, Index = 6
                (?<=(?=(?<StateId>.{6}))\A.*)
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q1Comma, Index = 4
                (?<=(?=(?<StateId>.{4}))\A.*)
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q1Semicolon, Index = 5
                (?<=(?=(?<StateId>.{5}))\A.*)
                # Push State A, Index = 0
                (?<StateId>)
            )
        )
        |
        (?:
            # When In State Q2Close, Index = 2
            (?<=(?=.{2}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            >
        )
        |
        (?:
            # When In State QStar, Index = 3
            (?<=(?=.{3}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            (?:
                # Push State QStar, Index = 3
                (?<=(?=(?<StateId>.{3}))\A.*)
                # Push State Q, Index = 1
                (?<=(?=(?<StateId>.{1}))\A.*)
                |

            )
        )
        |
        (?:
            # When In State Q1Comma, Index = 4
            (?<=(?=.{4}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            ,?
        )
        |
        (?:
            # When In State Q1Semicolon, Index = 5
            (?<=(?=.{5}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            ;
        )
        |
        (?:
            # When In State Q1Close, Index = 6
            (?<=(?=.{6}(?<=\A\k<StateId>))\A.*)
            (?<-StateId>)
            \]
        )
        |\Z
    ){7}
)+
\Z
# Assert state stack is empty
(?(StateId)(?!))

^{可悲的是，它太长了，无法放入一个 url，所以没有在线示例。}

如果我们使用一个或零个字符的“二进制”堆栈，它看起来像这样：https ://gist.github.com/kobi/8012361

这是通过所有测试的模式的屏幕截图：http: //i.stack.imgur.com/IW2xr.png

奖金

.Net 引擎可以做的不仅仅是平衡 - 它还可以捕获Aor的每个实例Q。这需要对模式稍作修改：https ://gist.github.com/kobi/8156968 。
请注意，我们已将组Start和A和添加Q到模式中，但它们对流没有影响，它们纯粹用于捕获。

结果：例如，对于 string "[<a>b;<c,d>,<e,f>]"，我们可以得到这些Captures：

Group A
    (0-17) [<a>b;<c,d>,<e,f>]
    (1-4) <a>b
    (2-2) a
    (7-9) c,d
    (13-15) e,f
Group Q
    (0-17) [<a>b;<c,d>,<e,f>]
    (1-3) <a>
    (2-2) a
    (4-4) b
    (6-10) <c,d>
    (7-7) c
    (9-9) d
    (12-16) <e,f>
    (13-13) e
    (15-15) f

开放式问题

每个语法都可以转换为状态堆栈表示法吗？
（状态计数）×（输入长度）是否足以匹配所有单词？还有什么公式可以起作用？

源代码

用于生成这些模式和所有测试用例的代码可以在https://github.com/kobi/RecreationalRegex上找到

score 10 · Accepted Answer

The answer is _^(probably) Yes.

The technique is much more complex than the (?1) recursive call, but the result is almost 1-to-1 with the rules of the grammar - I worked in a such methodical way I can easily see it scripted. Basically, you match block-by-block, and use the stack to keep track of where you are. This is an almost working solution:

^(?:
    (\w(?<Q>)) # Q1
    |
    (<(?<Angle>))  #Q2 - start <
    |
    (\>(?<-Angle>)(?<-A>)?(?<Q>))  #Q2 - end >, match Q
    |
    (\[(?<Block>))  # Q3 start - [
    |
    (;(?<Semi-Block>)(?<-A>)?)  #Q3 - ; after [
    |
    (\](?<-Semi>)(?<-Q>)*(?<Q>))  #Q3 ] after ;, match Q
    |
    ((,|(?<-Q>))*(?<A>))   #Match an A group
)*$
# Post Conditions
(?(Angle)(?!))
(?(Block)(?!))
(?(Semi)(?!))

It is missing the part of allowing commas in Q->[A;Q*,?Q*], and for some reason allows [A;A], so it matches [;,,] and [abc;d,e,f]. Rest of the strings match the same as the test cases.
Another minor point is an issue with pushing to the stack with an empty capture - it doesn't. A accepts Ø, so I had to use (?<-A>)? to check if it captured.

The whole regex should look like this, but again, it is useless with the bug there.

Why it isn't working?

There is not way of synchronizing the stacks: if I push (?<A>) and (?<B>), I can pop them in any order. That is why this pattern cannot differentiate <z[a;b>] from <z[a;b]>... we need one stack for all.
This can be solved for simple cases, but here we have something much more complicate - A whole Q or A, not just "<" or "[".