regex - Regex with Non-capturing Group

Question

I am trying to understand Non-capturing groups in Regex.

If I have the following input:

He hit the ball.  Then he ran.  The crowd was cheering!  How did he feel?  I felt so energized!

If I want to extract the first word in each sentence, I was trying to use the match pattern:

^(\w+\b.*?)|[\.!\?]\s+(\w+)

That puts the desired output in the submatch.

Match   $1
He      He  
. Then  Then
. The   The
! How   How
? I     I

But I was thinking that using non-capturing groups, I should be able to get them back in the match.

I tried:

^(?:\w+\b.*?)|(?:[\.!\?]\s+)(\w+)

and that yielded:

Match   $1
He  
. Then  Then
. The   The
! How   How
? I     I

and ^(?:\w+\b.*?)|(?:[.!\?]\s+)\w+

yielded:

Match
He
. Then
. The
! How
? I

What am I missing?

(I am testing my regex using RegExLib.com, but will then transfer it to VBA).

score 6 · Accepted Answer

针对字符串“foo”的简单示例：

(f)(o+)

将产生$1= 'f' 和$2= 'oo';

(?:f)(o+)

在这里，$1= 'oo' 因为您已明确表示不要捕获第一个匹配组。并且没有第二个匹配组。

对于您的情况，这感觉是正确的：

(?:(\w+).*?[\.\?!] {2}?)

请注意，最外面的组是非捕获组，而内部组（句子的第一个词）是捕获组。

score 1 · Accepted Answer

下面为边界条件构造一个非捕获组，并用捕获组捕获它之后的单词。

(?:^|[.?!]\s*)(\w+)

从您的问题中不清楚您如何将正则表达式应用于文本，但是您的常规“拉出另一个直到没有更多匹配项”循环应该可以工作。

score 0 · Accepted Answer

这很有效并且很简单：

([A-Z])\w*

VBA 需要这些标志设置：

Global = True 'Match all occurrences not just first
IgnoreCase = False 'First word of each sentence starts with a capital letter

这里有一些额外的来之不易的信息：由于您的正则表达式至少有一个括号集，您可以使用 Submatches 仅提取括号中的值并忽略其余的值 - 非常有用。这是我用来获取子匹配的函数的调试输出，在您的字符串上运行：

theMatches.Count=5
Match='He'
   Submatch Count=1
   Submatch='H'
Match='Then'
   Submatch Count=1
   Submatch='T'
Match='The'
   Submatch Count=1
   Submatch='T'
Match='How'
   Submatch Count=1
   Submatch='H'
Match='I'
   Submatch Count=1
   Submatch='I'

T

这是对返回上述内容的函数的调用：

sText = "He hit the ball.  Then he ran.  The crowd was cheering!  How did he feel?  I felt so energized!"
sRegEx = "([A-Z])\w*"
Debug.Print ExecuteRegexCapture(sText, sRegEx, 2, 0) '3rd match, 1st Submatch

这是功能：

'Returns Submatch specified by the passed zero-based indices:
'iMatch is which match you want,
'iSubmatch is the index within the match of the parenthesis
'containing the desired results.
Function ExecuteRegexCapture(sStringToSearch, sRegEx, iMatch, iSubmatch)
   Dim oRegex As Object
   Set oRegex = New RegExp
   oRegex.Pattern = sRegEx
   oRegex.Global = True 'True = find all matches, not just first
   oRegex.IgnoreCase = False
   oRegex.Multiline = True 'True = [\r\n] matches across line breaks, e.g. "([\r\n].*)" will match next line + anything on it
   bDebug = True

   ExecuteRegexCapture = ""

   Set theMatches = oRegex.Execute(sStringToSearch)
   If bDebug Then Debug.Print "theMatches.Count=" & theMatches.Count

   For i = 0 To theMatches.Count - 1
      If bDebug Then Debug.Print "Match='" & theMatches(i) & "'"
      If bDebug Then Debug.Print "   Submatch Count=" & theMatches(i).SubMatches.Count
      For j = 0 To theMatches(i).SubMatches.Count - 1
         If bDebug Then Debug.Print "   Submatch='" & theMatches(i).SubMatches(j) & "'"
      Next j
   Next i

   If bDebug Then Debug.Print ""

   If iMatch < theMatches.Count Then
      If iSubmatch < theMatches(iMatch).SubMatches.Count Then
         ExecuteRegexCapture = theMatches(iMatch).SubMatches(iSubmatch)
      End If
   End If
End Function

regex - Regex with Non-capturing Group

3 回答 3

Related

Reference