您有可选组:
(\bw+\b)?
问号使其成为可选匹配。但是 Python 将始终返回模式中的所有组,并且对于任何不匹配任何内容的组,将返回一个空值(None
通常):
>>> import re
>>> example = 'waiting for coffee, waiting for coffee and the charitable crumb.'
>>> pattern = re.compile(r'(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[,;]\s*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?')
>>> pattern.search(example).groups()
('waiting', 'for', 'coffee', None, 'waiting', 'for', 'coffee', 'and')
Note the None
in the output, that's the 4th word-group before the comma not matching anything because there are only 3 words to match. You must've used .findall()
, which explicitly returns strings, and the pattern group that didn't match is thus represented as an empty string instead.
Remove the question marks, and your pattern won't match your input example until you add that required 4th word before the comma:
>>> pattern_required = re.compile(r'(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[,;]\s*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)')
>>> pattern_required.findall(example)
[]
>>> pattern_required.findall('Not ' + example)
[('Not', 'waiting', 'for', 'coffee', 'waiting', 'for', 'coffee', 'and')]
If you need to match between 2 and 4 words, but do not want empty groups, you'll have to make one group match multiple words. You cannot have a variable number of groups, regular expressions do not work like that.
Matching multiple words in one group:
>>> pattern_variable = re.compile(r'(\b\w+\b)[^a-z]*((?:\b\w+\b[^a-z]*){1,3})[,;]\s*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)')
>>> pattern_variable.findall(example)
[('waiting', 'for coffee', 'waiting', 'for', 'coffee', 'and')]
>>> pattern_variable.findall('Not ' + example)
[('Not', 'waiting for coffee', 'waiting', 'for', 'coffee', 'and')]
Here the (?:...)
syntax creates a non-capturing group, one that does not produce output in the .findall()
list; used here so we can put a quantifier on it. {1,3}
tells the regular expression we want the preceding group to be matched between 1 and 3 times.
Note the output; the second group contains a variable number of words (between 1 and 3).