regex - 正则表达式：最后一个打开的括号之后的文本

Question

我对 RegEx 有一点了解，但目前，它远远超出了我的能力。

我需要帮助才能在最后一个没有匹配右括号的左括号之后立即找到文本/表达式。

它用于开发中的开源软件（Object Pascal）的 CallTip。

下面是一些例子：

------------------------------------
Text                  I need
------------------------------------
aaa(xxx               xxx
aaa(xxx,              xxx
aaa(xxx, yyy          xxx
aaa(y=bbb(xxx)        y=bbb(xxx)
aaa(y <- bbb(xxx)     y <- bbb(xxx)
aaa(bbb(ccc(xxx       xxx
aaa(bbb(x), ccc(xxx   xxx
aaa(bbb(x), ccc(x)    bbb(x)
aaa(bbb(x), ccc(x),   bbb(x)
aaa(?, bbb(??         ??
aaa(bbb(x), ccc(x))   ''
aaa(x)                ''
aaa(bbb(              ''
------------------------------------

For all text above the RegEx proposed by @Bohemian
(?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(?=[ ,]|$)(?! <-)(?<! <-)
matches all cases.

For the below (I found these cases when implementing the RegEx in the software) not
------------------------------------
New text              I need
------------------------------------
aaa(bbb(x, y)         bbb(x, y)
aaa(bbb(x, y, z)      bbb(x, y, z)
------------------------------------

是否可以为这些情况编写正则表达式（PCRE）？

在上一篇文章中（正则表达式：最后一个左括号之前的单词）艾伦摩尔（非常感谢新近）用下面的正则表达式帮助我找到最后一个左括号之前的文本：

\w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)

但是，我无法在之后立即进行适当的调整以匹配。

任何人都可以帮忙吗？

score 6 · Accepted Answer

这类似于这个问题。而且由于您使用的是 PCRE，使用递归语法，实际上有一个解决方案。

/
(?(DEFINE)                # define a named capture for later convenience
  (?P<parenthesized>      # define the group "parenthesized" which matches a
                          # substring which contains correctly nested
                          # parentheses (it does not have to be enclosed in
                          # parentheses though)
    [^()]*                # match arbitrarily many non-parenthesis characters
    (?:                   # start non capturing group
      [(]                 # match a literal opening (
      (?P>parenthesized)  # recursively call this "parenthesized" subpattern
                          # i.e. make sure that the contents of these literal ()
                          # are also correctly parenthesized
      [)]                 # match a literal closing )
      [^()]*              # match more non-parenthesis characters
    )*                    # repeat
  )                       # end of "parenthesized" pattern
)                         # end of DEFINE sequence

# Now the actual pattern begins

(?<=[(])                  # ensure that there is a literal ( left of the start
                          # of the match
(?P>parenthesized)?       # match correctly parenthesized substring
$                         # ensure that we've reached the end of the input
/x                        # activate free-spacing mode

这种模式的要点显然是parenthesized子模式。我也许应该详细说明一下。它的结构是这样的：

(normal* (?:special normal*)*)

在哪里和normal是。这种技术称为“展开循环”。它用于匹配任何具有结构的东西[^()]special[(](?P>parenthesized)[)]

nnnsnnsnnnnsnnsnn

wheren匹配normal和s匹配special。

在这种特殊情况下，事情有点复杂，因为我们也在使用递归。(?P>parenthesized)递归地使用parenthesized模式（它是它的一部分）。您可以查看(?P>...)语法有点像反向引用 - 除了引擎不会尝试匹配组...匹配的内容，而是再次应用它的子模式。

另请注意，我的模式不会为您提供正确括号模式的空字符串，但会失败。您可以通过省略后视来解决此问题。向后看实际上是没有必要的，因为引擎总是会返回最左边的匹配项。

编辑：从您的两个示例来看，您实际上并不想要最后一个不匹配的括号之后的所有内容，而只想要第一个逗号之前的所有内容。您可以使用我的结果并拆分,或尝试 Bohemian 的答案。

进一步阅读：

PCRE 子模式（包括命名组）
PCRE 递归
Jeffrey Friedl 在他的《掌握正则表达式》一书中介绍了“展开循环” ，但我认为我上面链接的帖子给出了一个很好的概述。
使用(?(DEFINE)...)实际上是在滥用另一个称为条件模式的功能。PCRE 手册页解释了它是如何工作的——只需在页面中搜索“Defining subpatterns for use by reference only”即可。

编辑：我注意到您在问题中提到您正在使用 Object Pascal。在这种情况下，您可能实际上并未使用 PCRE，这意味着不支持递归。在这种情况下，该问题就没有完整的正则表达式解决方案。如果我们施加一个限制，例如“在最后一个不匹配的括号之后只能再有一个嵌套级别”（就像在您的所有示例中一样），那么我们可以提出一个解决方案。同样，我将使用“展开循环”来匹配表单的子字符串xxx(xxx)xxx(xxx)xxx。

(?<=[(])         # make sure we start after an opening (
(?=              # lookahead checks that the parenthesis is not matched
  [^()]*([(][^()]*[)][^()]*)*
                 # this matches an arbitrarily long chain of parenthesized
                 # substring, but allows only one nesting level
  $              # make sure we can reach the end of the string like this
)                # end of lookahead
[^(),]*([(][^()]*[)][^(),]*)*
                 # now actually match the desired part. this is the same
                 # as the lookahead, except we do not allow for commas
                 # outside of parentheses now, so that you only get the
                 # first comma-separated part

如果您曾经添加一个输入示例，例如aaa(xxx(yyy())您想要匹配的位置，xxx(yyy())那么这种方法将不会匹配它。事实上，没有不使用递归的正则表达式可以处理任意嵌套级别。

由于您的正则表达式风格不支持递归，因此您可能完全不使用正则表达式会更好。即使我的最后一个正则表达式与您当前的所有输入示例匹配，它也确实很复杂，可能不值得麻烦。不如这样：逐个字符地遍历字符串并维护一堆括号位置。然后下面的伪代码为您提供最后一个 unmatched 之后的所有内容(：

while you can read another character from the string
    if that character is "(", push the current position onto the stack
    if that character is ")", pop a position from the stack
# you've reached the end of the string now
if the stack is empty, there is no match
else the top of the stack is the position of the last unmatched parenthesis;
     take a substring from there to the end of the string

然后要获得直到第一个未嵌套逗号的所有内容，您可以再次遍历该结果：

nestingLevel = 0
while you can read another character from the string
    if that character is "," and nestingLevel == 0, stop
    if that character is "(" increment nestingLevel
    if that character is ")" decrement nestingLevel
take a substring from the beginning of the string to the position at which
  you left the loop

这两个短循环在未来对其他人来说将更容易理解，并且比正则表达式解决方案（至少一个没有递归）更灵活。

score 1 · Accepted Answer

使用前瞻：

(?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(\(.*?\))?(?=[ ,]|$)(?! <-)(?<! <-)

看到这个在 rubular 上运行，通过了问题中发布的所有测试用例。

regex - 正则表达式：最后一个打开的括号之后的文本

2 回答 2

Related

Reference