python - 如何使用 `daringfireball` 的正则表达式 re.findall()？

Question

daringfireball我已经使用下面的代码从使用正则表达式http://daringfireball.net/2010/07/improved_regex_for_matching_urls的 html 页面中提取 url ，即

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s!()[]{};:'".,<>?«»“”'']))`

正则表达式的效果惊人，但使用re.findall()几乎要花很长时间。无论如何，我可以快速获取 html 中的所有url 吗？

import urllib, re

seed = "http://web.archive.org/web/20100412111652/http://app.singaporeedu.gov.sg/asp/index.asp"

page = urllib.urlopen(seed).read().decode('utf8')
#print page

pattern = r'''(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''

match = re.search(pattern,page)
print match.group(0)

matches = re.findall(pattern,page) # this line takes more than 3 mins on my i3 laptop
print matches

score 1 · Accepted Answer

是的。根本不使用正则表达式。使用 HTML 解析器，例如BeautifulSoup. 这就是他们的目的。

>>> from bs4 import BeautifulSoup as BS
>>> import urllib2
>>> seed = "http://web.archive.org/web/20100412111652/http://app.singaporeedu.gov.sg/asp/index.asp"
>>> soup = BS(urllib2.urlopen(seed))
>>> print soup.find_all('a')

score 0 · Accepted Answer

您只想要页面中的所有网址吗？像这样的简单正则表达式还不够吗？

<a[^>]*href="([^"]+)"[^>]*>

python - 如何使用 `daringfireball` 的正则表达式 re.findall()？

2 回答 2

Related

Reference