将链接模式限制为非引号字符:
re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE)
给予:
>>> import re
>>> patt = re.compile('<a href="([^"]+?)">See full summary</a>', re.DOTALL | re.IGNORECASE)
>>> patt.findall('<a href="link">text</a> <a href="correctLink">See full summary</a>')
['correctLink']
更好的是,使用适当的 HTML 解析器。
使用BeautifulSoup,找到该链接将很容易:
soup.find('a', text='See full summary')['href']
对于精确的文本匹配:
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup('<a href="link">text</a> <a href="correctLink">See full summary</a>')
>>> soup.find('a', text='See full summary')['href']
u'correctLink'