python - 如何从网页中查找和提取链接？

Question

我有网站，例如http://site.com

我想获取主页并仅提取与正则表达式匹配的链接，例如.*somepage.*

html代码中链接的格式可以是：

<a href="http://site.com/my-somepage">url</a> 
<a href="/my-somepage.html">url</a> 
<a href="my-somepage.htm">url</a>

我需要输出格式：

http://site.com/my-somepage
http://site.com/my-somepage.html
http://site.com/my-somepage.htm

输出 url 必须始终包含域名。

什么是快速的python解决方案？

score 2 · Accepted Answer

你可以使用lxml.html：

from lxml import html

url = "http://site.com"
doc = html.parse(url).getroot() # download & parse webpage
doc.make_links_absolute(url)
for element, attribute, link, _ in doc.iterlinks():
    if (attribute == 'href' and element.tag == 'a' and
        'somepage' in link): # or e.g., re.search('somepage', link)
        print(link)

或同样使用beautifulsoup4：

import re
try:
    from urllib2 import urlopen
    from urlparse import urljoin
except ImportError: # Python 3
    from urllib.parse import urljoin
    from urllib.request import urlopen

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

url = "http://site.com"
only_links = SoupStrainer('a', href=re.compile('somepage'))
soup = BeautifulSoup(urlopen(url), parse_only=only_links)
urls = [urljoin(url, a['href']) for a in soup(only_links)]
print("\n".join(urls))

score 1 · Accepted Answer

使用 HTML Parsing 模块，例如BeautifulSoup。
一些代码（只有一些）：

from bs4 import BeautifulSoup
import re

html = '''<a href="http://site.com/my-somepage">url</a> 
<a href="/my-somepage.html">url</a> 
<a href="my-somepage.htm">url</a>'''
soup = BeautifulSoup(html)
links = soup.find_all('a',{'href':re.compile('.*somepage.*')})
for link in links:
    print link['href']

输出：

http://site.com/my-somepage
/my-somepage.html
my-somepage.htm

你应该能够从这么多数据中得到你想要的格式......

score 1 · Accepted Answer

Scrapy是做你想做的最简单的方法。实际上内置了链接提取机制。

如果您在编写蜘蛛以抓取链接方面需要帮助，请告诉我。

另请参阅：

python - 如何从网页中查找和提取链接？

3 回答 3

Related

Reference