5

我有一个非常简单的正则表达式,类似于:

HOHO.*?_HO_

有了这个测试字符串...

fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev

  • 我希望它只匹配_HOHO___HO_(最短匹配,非贪婪)
  • 相反,它匹配_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_(最长匹配,看起来很贪婪)。

为什么?我怎样才能让它匹配最短的匹配?

添加和删​​除?给出相同的结果。

编辑- 更好的测试字符串显示为什么[^HOHO]不起作用:fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye


我能想到的只是它可能匹配多次 - 但只有一个匹配_HO_,所以我不明白为什么它不采用以 结尾的最短匹配,而_HO_丢弃其余的。

我浏览了所有我能找到的标题为“非贪婪正则表达式贪婪”的问题,但它们似乎都有其他问题。

4

3 回答 3

11

在Regex lazy vs greedy chaos的帮助下,我找到了一个解决方案。

在 Javascript 使用的正则表达式引擎(我相信NFA 引擎)中,非贪婪只会为您提供从左到右最短的匹配- 从第一个左手匹配到最近的右手匹配。

如果一个右手匹配有很多左手匹配,它总是从它到达的第一个开始(这实际上会给出最长的匹配)。

本质上,它一次一个字符地遍历字符串,询问“这个字符是否匹配?如果有,匹配最短的并完成。如果没有,移动到下一个字符,重复”。我希望它是“这个字符串中的任何地方都有匹配项吗?如果有,匹配所有匹配项中最短的”。


.您可以通过将 替换为表示“不是左侧匹配”的否定来近似在两个方向上都非贪婪的正则表达式。要否定这样的字符串需要负前瞻和非捕获组,但它就像将字符串放入(?:(?!).). 例如,(?:(?!HOHO).)

例如,HOHO.*?_HO_左右非贪婪的等价物是:

HOHO(?:(?!HOHO).)*?_HO_

所以正则表达式引擎基本上是像这样遍历每个字符:

  • HOHO - Does this match the left side?
  • (?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
  • _HO_ - If so, grab everything until the right-hand match
  • ? modifier on * or + - If there are multiple right-hand matches, choose the nearest one
于 2014-12-09T18:15:29.760 回答
5

Why it matches the whole string?

This is because regular-expression pattern matching is done by finding the first position in the string at which a match is possible. Since a match is possible starting at the first character of the string, shorter matches starting at subsequent characters are never even considered.

Example:
Let's consider a regular expression /a+?b/ and test string "aaaaaaaaab". When applied to the string it matches the whole string. Not just last a & b. This is because the first position in the string where a match is possible is at the first a.

So, if you want to match ab in aaaaaaaaab, use a negated character class based regex rather than a lazy dot:

a[^ab]*b

See the regex demo.

Source: Javascript: The Definitive Guide, Sixth Edition, Page Number: 255

于 2014-12-10T11:06:47.260 回答
4

The result is non-greedy, because it's the shortest match from the first occurrence of HOHO until _HO_ is reached; the engine traverses the string from left to right and because it doesn't have to backtrack, it won't attempt to shorten anything.

To make it work in the way that's expected here, you need to have a greedy prefix in your expression:

/.*(HOHO.*?_HO_)/

The first memory capture contains the string that you're after; the greedy prefix will try to skip as many characters as possible, so it will match the last occurrence of HOHO first.

于 2014-12-10T11:03:41.320 回答