1

1. About Python regex 2019.02.21

Python is upgrading the regex module. The latest release is from Feb 21, 2019. You can consult it here: https://pypi.org/project/regex/

It will replace the re module in time. For now, you need to install it manually with pip install regex and import the regex module instead of re.

 

2. New regex feature

The coolest feature about the newest version is Recursive patterns. Read more about it here: https://bitbucket.org/mrabarnett/mrab-regex/issues/27

This feature enables to find matching paretheses ( .. ) or curly brackets { .. }. The following webpage explains how to do that: https://www.regular-expressions.info/recurse.html#balanced I quote:

The main purpose of recursion is to match balanced constructs or nested constructs. The generic regex is b(?:m|(?R))*e where b is what begins the construct, m is what can occur in the middle of the construct, and e is what can occur at the end of the construct. For correct results, no two of b, m, and e should be able to match the same text. You can use an atomic group instead of the non-capturing group for improved performance: b(?>m|(?R))*e.  
 
A common real-world use is to match a balanced set of parentheses. \((?>[^()]|(?R))*\) matches a single pair of parentheses with any text in between, including an unlimited number of parentheses, as long as they are all properly paired.

 

3. My question

I'm experimenting with matching curly brackets { .. }. So I simply apply the regex from the webpage above, but I replace ( by {. That gives me the following regex:

{(?>[^{}]|(?R))*}

I try it on https://regex101.com and get beautiful results(*):

enter image description here

I want to take it one step further, and find a specific set of matching curly braces, like so:

MEMORY\s*{(?>[^{}]|(?R))*}

The result is great:

enter image description here

But when I try

SECTIONS\s*{(?>[^{}]|(?R))*}

Nothing gets found. No match. The only difference between the MEMORY{..} and SECTIONS{..} part is that the latter has some nested curly braces. So the problem should be found in there. But I don't know how to fix this.


* Note 1:
On https://regex101.com, you can select the flavor of the regex. Usually I select Python, but this time I selected PCRE(PHP) because the regex101 website didn't apply the latest Python regexes upgrade yet.  
To confirm the results, I also try it in a simple python-session in my terminal, with commands like:
import regex
p = regex.compile(r"...")
text = """ ... """
p.findall(text)

* Note 2:
The text I use for testing is:

MEMORY
{
    /* Foobar */
    consectetur adipiscing elit,
    sed do eiusmod tempor incididunt
}
Lorem ipsum dolor sit amet,

SECTIONS
{
    ut labore et dolore magna aliqua.
    /* Foobar */
        FOO
        {
            /* Foobar */
            Ut enim ad minim veniam,
            quis nostrud exercitation ullamco
        }

        BAR
        {
            /* Foobar */
            laboris nisi
            ut
        }
    aliquip ex ea commodo consequat.
}
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

4

1 回答 1

1

You recurse the whole pattern with (?R) construct while you only want to recurse the {...} subpattern. Wrap it with a capturing group and recurse it with a subroutine:

p = regex.compile(r"SECTIONS\s*({(?>[^{}]|(?1))*})")
for m in p.finditer(text):
    print(m.group())

See the Python regex demo online.

Note the same issue is with your first pattern, it won't work if you add nested curly braces there. Fix it as MEMORY\s*({(?>[^{}]|(?1))*}).

于 2019-03-07T14:25:52.403 回答