python - Python 正则表达式评估期间的空分组

Question

学习python的正则表达式。我要感谢 Jerry 对这个问题的初步帮助。我测试了这个正则表达式：

(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[,;]\s*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?

在http://regex101.com/上，它找到了我要查找的内容，即句子中逗号之前的四个单词和逗号之后的四个单词。如果句子开头的逗号前有三个两个单词，则不会崩溃。我正在使用的测试句子是：

waiting for coffee, waiting for coffee and the charitable crumb.

现在正则表达式返回：

[('waiting', 'for', 'coffee', '', 'waiting', 'for', 'coffee', 'and')]

我不太明白为什么集合的第四个成员是空的。在这种情况下，我想要的是正则表达式只返回逗号前的 3 和逗号后的 4，但如果逗号前有四个单词，我希望它返回四个。我知道正则表达式因语言而异，这是我在 python 中缺少的东西吗？

score 4 · Accepted Answer

您有可选组：

(\bw+\b)?

问号使其成为可选匹配。但是 Python 将始终返回模式中的所有组，并且对于任何不匹配任何内容的组，将返回一个空值（None通常）：

>>> import re
>>> example = 'waiting for coffee, waiting for coffee and the charitable crumb.'
>>> pattern = re.compile(r'(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[,;]\s*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?')
>>> pattern.search(example).groups()
('waiting', 'for', 'coffee', None, 'waiting', 'for', 'coffee', 'and')

Note the None in the output, that's the 4th word-group before the comma not matching anything because there are only 3 words to match. You must've used .findall(), which explicitly returns strings, and the pattern group that didn't match is thus represented as an empty string instead.

Remove the question marks, and your pattern won't match your input example until you add that required 4th word before the comma:

>>> pattern_required = re.compile(r'(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[,;]\s*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)')
>>> pattern_required.findall(example)
[]
>>> pattern_required.findall('Not ' + example)
[('Not', 'waiting', 'for', 'coffee', 'waiting', 'for', 'coffee', 'and')]

If you need to match between 2 and 4 words, but do not want empty groups, you'll have to make one group match multiple words. You cannot have a variable number of groups, regular expressions do not work like that.

Matching multiple words in one group:

>>> pattern_variable = re.compile(r'(\b\w+\b)[^a-z]*((?:\b\w+\b[^a-z]*){1,3})[,;]\s*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)')
>>> pattern_variable.findall(example)
[('waiting', 'for coffee', 'waiting', 'for', 'coffee', 'and')]
>>> pattern_variable.findall('Not ' + example)
[('Not', 'waiting for coffee', 'waiting', 'for', 'coffee', 'and')]

Here the (?:...) syntax creates a non-capturing group, one that does not produce output in the .findall() list; used here so we can put a quantifier on it. {1,3} tells the regular expression we want the preceding group to be matched between 1 and 3 times.

Note the output; the second group contains a variable number of words (between 1 and 3).

score 2 · Accepted Answer

Since you've got an answer as to how to sort out your regex, I'd point out that in Python - stuff like this is normally much more easily done, and readable via using builtin string functions, eg:

s = 'waiting for coffee, waiting for coffee and the charitable crumb.'
before, after = map(str.split, s.partition(',')[::2])
print before[-4:], after[:4]
# ['waiting', 'for', 'coffee'] ['waiting', 'for', 'coffee', 'and']

score 0 · Accepted Answer

When you've already got a regex that's that long and convoluted I highly suggest you don't try fixing your problem by adding more regex. It will only end in tears. If you want to get rid of that empty group I would consider just running:

filter(None, regex_return)

On the answer you get back.

For example:

test = ('waiting', 'for', 'coffee', '', 'waiting', 'for', 'coffee', 'and')
print filter(None, test)
>>> ('waiting', 'for', 'coffee', 'waiting', 'for', 'coffee', 'and')

Which I believe does what you want.

python - Python 正则表达式评估期间的空分组

3 回答 3

Related

Reference