python - Python正则表达式为最后一个匹配的字符返回额外的捕获组

Question

我正在尝试创建一个正则表达式，它将接受字符串并将它们分成三组：（1）字符串开头的特定单词列表中的任何一个。(2) 字符串末尾的特定单词列表中的任何一个。(3) 这两个匹配项之间的所有字母/空格。

例如，我将使用以下两个字符串：

'There was a cat in the house yesterday'
'Did you see a cat in the house today'

我希望将字符串分解为捕获组，以便匹配对象m.groups()分别为每个字符串返回以下内容：

('There', ' was a cat in the house ', 'yesterday')
('Did', ' you see a cat in the house ', 'today')

最初，我想出了以下正则表达式：

r = re.compile('^(There|Did) ( |[A-Za-z])+ (today|yesterday)$')

但是，这会返回：

('There', 'e', 'yesterday')
('Did', 'e', 'today')

所以它只给了我在中间组中匹配的最后一个字符。我了解到这不起作用，因为捕获组只会返回匹配的最后一次迭代。所以我在中间捕获组周围加上括号，如下所示：

r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')

但是现在，虽然它至少捕获了中间组，但它还在中返回了一个额外的“e”字符m.groups()，即：

('There', 'was a cat in the house', 'e', 'yesterday')

...虽然我觉得这与回溯有关，但我无法弄清楚为什么会这样。有人可以向我解释为什么我会得到这个结果，以及我怎样才能得到想要的结果？

score 1 · Accepted Answer

您可以简化当前的正则表达式并获得正确的行为，方法是用.匹配任何字符的（点）运算符替换中间捕获组，然后使用*（星号）运算符重复匹配任何字符：

import re

s1 = 'There was a cat in the house yesterday'
s2 = 'Did you see a cat in the house today'

x = re.compile("(There|Did)(.*)(today|yesterday)")
g1 = x.search(s1).groups()
g2 = x.search(s2).groups()

print(g1)
print(g2)

产生这个输出：

('There', ' was a cat in the house', 'yesterday')
('Did', 'you see a cat in the house', 'today')

score 1 · Accepted Answer

重复捕获组将仅捕获最后一次迭代。如果您对数据不感兴趣，请在重复组周围放置一个捕获组以捕获所有迭代或使用非捕获组。

来源https://regex101.com/

这是按预期进行的重新工作：

^(There|Did) ([ A-Za-z]+) (today|yesterday)$

score 1 · Accepted Answer

 r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')
                               ^ ^        ^

你有一些不必要的东西。取出这些并在中间组中包含空间：

r = re.compile('^(There|Did) ([A-Za-z ]+) (today|yesterday)$')
                                     ^ space

例子：

>>> r = re.compile('^(There|Did) ([A-Za-z ]+) (today|yesterday)$')
>>> r.search('There was a a cat in the hosue yesterday').groups()
('There', 'was a a cat in the hosue', 'yesterday')

此外，如果您希望空间成为中间（第二）组的一部分，请取出捕获组之间的空格

python - Python正则表达式为最后一个匹配的字符返回额外的捕获组

3 回答 3

Related

Reference