python - 在 Python 中使用正则表达式匹配大于 HTML 的字符

Question

我正在尝试使用 re.compile 来匹配网页上的值

我的网页包含以下 HTML：

<div id="paginate">
&nbsp;<strong>1</strong>
&nbsp;<a href="http://www.link2.com/">2</a>
&nbsp;<a href="http://www.link3.com/">3</a>
&nbsp;<a href="http://www.link2.com">&gt;</a>
&nbsp;&nbsp;<a href="http://www.link20.com/">Last &rsaquo;</a>
</div>

我的正则表达式如下：

re.compile('<a href="(.+?)">&gt;</a>').findall()

这返回

['http://www.link2.com/">2</a>
&nbsp;<a href="http://www.link3.com">3</a>
&nbsp;<a href="http://www.link2.com/']

我只想获取包含大于符号作为标签的链接的href？

有任何想法吗？

提前致谢

score 2 · Accepted Answer

Just use re.findall():

>>> re.findall('<a href="(.+?)">&gt;</a>', html)
['http://www.link4.com']

Note that you really should be parsing HTML with an HTML parser and not regex. I suggest BeautifulSoup:

>>> from bs4 import BeautifulSoup as BS
>>> soup = BS(html)
>>> print soup.find('a', text='>')
<a href="http://www.link4.com">&gt;</a>
>>> print soup.find('a', text='>')['href']
http://www.link4.com

python - 在 Python 中使用正则表达式匹配大于 HTML 的字符

1 回答 1

Related

Reference