python - 如何使用正则表达式处理嵌套括号？

Question

我想出了一个将给定文本解析为 3 类的正则表达式字符串：

在括号内
括号内的
两者都不。

像这样：

\[.+?\]|\(.+?\)|[\w+ ?]+

我的意图是只使用最外层的运算符。因此，鉴于a(b[c]d)e，拆分将是：

a || (b[c]d) || e

考虑到括号内的括号或括号内的括号，它可以正常工作，但是当括号内有括号并且括号内有括号时，它会崩溃。例如，a[b[c]d]e被拆分为

a || [b[c] || d || ] || e.

有没有办法单独使用正则表达式来处理这个问题，而不是使用代码来计算开/关括号的数量？谢谢！

score 10 · Accepted Answer

标准¹正则表达式不够复杂，无法匹配这样的嵌套结构。解决此问题的最佳方法可能是遍历字符串并跟踪开/关括号对。

¹我说的是标准，但并不是所有的正则表达式引擎都是标准的。例如，您可以通过使用递归正则表达式来使用 Perl 来实现这一点。例如：

$str = "[hello [world]] abc [123] [xyz jkl]";

my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;

foreach (@matches) {
    print "$_\n";
}

[你好世界]]
美国广播公司
[123]
[xyz jkl]

编辑：我看到你正在使用 Python；退房pyparsing。

score 1 · Accepted Answer

好吧，一旦你放弃解析嵌套表达式应该在无限深度下工作的想法，你可以通过提前指定最大深度来很好地使用正则表达式。方法如下：

def nested_matcher (n):
    # poor man's matched paren scanning, gives up after n+1 levels.
    # Matches any string with balanced parens or brackets inside; add
    # the outer parens yourself if needed.  Nongreedy.  Does not
    # distinguish parens and brackets as that would cause the
    # expression to grow exponentially rather than linearly in size.
    return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n

import re

p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]')
print p.findall('a(b[c]d)e')
print p.findall('a[b[c]d]e')
print p.findall('[hello [world]] abc [123] [xyz jkl]')

这将输出

['a', '(b[c]d)', 'e']
['a', '[b[c]d]', 'e']
['[hello [world]]', ' abc ', '[123]', ' ', '[xyz jkl]']

python - 如何使用正则表达式处理嵌套括号？

2 回答 2

Related

Reference