python - Why does re.findall() find more matches than re.sub()?

Question

Consider the following:

>>> import re
>>> a = "first:second"
>>> re.findall("[^:]*", a)
['first', '', 'second', '']
>>> re.sub("[^:]*", r"(\g<0>)", a)
'(first):(second)'

re.sub()'s behavior makes more sense initially, but I can also understand re.findall()'s behavior. After all, you can match an empty string between first and : that consists only of non-colon characters (exactly zero of them), but why isn't re.sub() behaving the same way?

Shouldn't the result of the last command be (first)():(second)()?

score 9 · Accepted Answer

您使用 * 允许空匹配：

'first'   -> matched
':'       -> not in the character class but, as the pattern can be empty due 
             to the *, an empty string is matched -->''
'second'  -> matched
'$'       -> can contain an empty string before,
             an empty string is matched -->''

引用文档re.findall()：

空匹配包含在结果中，除非它们触及另一个匹配的开始。

您在子结果中看不到空匹配项的原因在以下文档中进行了re.sub()解释：

仅当与前一个匹配项不相邻时，才会替换该模式的空匹配项。

尝试这个：

re.sub('(?:Choucroute garnie)*', '#', 'ornithorynque')

现在这个：

print re.sub('(?:nithorynque)*', '#', 'ornithorynque')

没有连续的#

score 3 · Accepted Answer

出于某种原因，处理空匹配的算法是不同的。

在的情况下findall，它的工作方式类似于（的优化版本）：对于每个可能的起始索引 0 <= i <= len(a)，如果字符串在 i 处匹配，则追加匹配；并通过使用此规则避免重叠结果：如果在 i 处存在长度为 m 的匹配，则不要在 i+m 之前查找下一个匹配。您的示例返回的原因是在and之后['first', '', 'second', '']立即找到空匹配项，而不是在冒号之后 --- 因为从该位置开始查找匹配项会返回完整的字符串。firstsecondsecond

在的情况下sub，正如您注意到的那样，区别在于它明确忽略了在另一个匹配之后立即发生的长度为 0 的匹配。虽然我明白为什么这可能有助于避免的意外行为sub，但我不确定为什么会有这种差异（例如，为什么不findall使用相同的规则）。

score 1 · Accepted Answer

import re
a = "first:second:three"
print re.findall("[^:]*", a)

返回所有匹配模式的子字符串，在这里，它给出

>>> 
['first', '', 'second', '', 'three', '']

sub()用于替换，并将用您的替换替换最左边不重叠的模式。前任

import re
a = "first:second:three"
print re.sub("[^:]*", r"smile", a)

给

>>> 
smile:smile:smile

您可以命令用第四个参数替换出现的次数，count：

python - Why does re.findall() find more matches than re.sub()?

3 回答 3

Related

Reference