python - Python 正则表达式意外行为

Question

str1='<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"'
str2='<a href="/states/florida/433" title="florida">'
pat = re.compile('/states/.*/([^"]+)')
if ( pat.findall(str2) == pat.findall(str1)):
    print "TRUE"
else:
    print "FALSE"

输出：假，

输出 2：433
输出 1：abc.com

有人可以解释一下吗？

score 3 · Accepted Answer

使用不情愿的量词 - .*?，而不是贪婪的量词 -.*一切都会好起来的： -

pat = re.compile('/states/.*?/([^"]+)')

默认情况下，量词是贪婪的，从某种意义上说，它们试图覆盖尽可能多的字符串，并且仍然保留模式的其余部分以匹配剩余的字符串。使用?after 量词使其不情愿，在这种情况下，它们将在下一个字符的第一个匹配处停止 -/在这种情况下。

score 1 · Accepted Answer

在第一个 URL 上，您的正则表达式匹配整个字符串：

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com
         /states/                                .*                         /([^"]+)

并不是

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com
         /states/ .*   /([^"])+

他们很贪婪，.*尽可能多地吃数据。

score 1 · Accepted Answer

您的 RegEx 工作正常：

<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"
         ^^^^^^^^............................................................^^^^^^^
         /states/                      .*/                                     [^"]+

和：

<a href="/states/florida/433" title="florida">
         ^^^^^^^^........^^^

如果您不想在第一种情况下使用整个字符串，请使用?非贪婪匹配量词来表示“/states/后跟任意数量的字符，直到第一个 /字符后跟一个或多个非引号字符”

score 0 · Accepted Answer

你的模式是贪婪的（你可以在这里阅读关于贪婪和非贪婪的正则表达式模式： http: //docs.python.org/2/library/re.html和这里：http ://www.itworld.com/nl /perl/01112001 . 改变模式

'/states/.*/([^"]+)'

到

'/states/.*/([^"]+)'

返回真。这是完整的修改源：

import re

str1='<a href="/states/florida/433" title="florida"><img alt="florida" src="http://abc.com"'
str2='<a href="/states/florida/433" title="florida">'
pat = re.compile('/states/.*?/([^"]+)')
if ( pat.findall(str2) == pat.findall(str1)):
    print "TRUE"
else:
    print "FALSE"

python - Python 正则表达式意外行为

4 回答 4

Related

Reference