regex - 正则表达式贪婪匹配没有按预期工作

Question

我有一个非常基本的正则表达式，我只是不知道为什么它不起作用，所以问题是两个部分。为什么我当前的版本不起作用，正确的表达方式是什么。

规则非常简单：

必须至少有 3 个字符。
如果 % 字符是第一个字符，则必须至少有 4 个字符。

因此，以下情况应按如下方式解决：

AB - 失败
ABC-通过
ABCDEFG - 通过
％ - 失败
%AB - 失败
%ABC - 通过
%ABCDEFG - 通过
%%AB - 通过

我使用的表达式是：

^%?\S{3}

这对我来说意味着：

^- 字符串的开始
%?- 贪心检查 0 或 1 % 字符
\S{3}- 3 个非空白字符

The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?

Someone please show me the light :)

Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2}) Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.

score 9 · Accepted Answer

Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".

Instead, what you probably want is something like this:

^(%\S{3}|[^%\s]\S{2})

Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.

score 9 · Accepted Answer

The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.

Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:

^(?>%?)\S{3}

If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.

Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as @Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.

score 1 · Accepted Answer

I always love to look at RE questions to see how much time people spend on them to "Save time"

str.len() >= str[0]=='&' ? 4 : 3

Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)

score 0 · Accepted Answer

Try the regex modified a little based on Dav's original one:

^(%\S{3,}|[^%\s]\S{2,})

with the regex option "^ and $ match at line breaks" on.

regex - 正则表达式贪婪匹配没有按预期工作

4 回答 4

Related

Reference