我正在尝试学习如何使用asyncio
来构建异步网络爬虫。下面是一个粗略的爬虫来测试框架:
import asyncio, aiohttp
from bs4 import BeautifulSoup
@asyncio.coroutine
def fetch(url):
with (yield from sem):
print(url)
response = yield from aiohttp.request('GET',url)
response = yield from response.read_and_close()
return response.decode('utf-8')
@asyncio.coroutine
def get_links(url):
page = yield from fetch(url)
soup = BeautifulSoup(page)
links = soup.find_all('a',href=True)
return [link['href'] for link in links if link['href'].find('www') != -1]
@asyncio.coroutine
def crawler(seed, depth, max_depth=3):
while True:
if depth > max_depth:
break
links = yield from get_links(seed)
depth+=1
coros = [asyncio.Task(crawler(link,depth)) for link in links]
yield from asyncio.gather(*coros)
sem = asyncio.Semaphore(5)
loop = asyncio.get_event_loop()
loop.run_until_complete(crawler("http://www.bloomberg.com",0))
虽然asyncio
似乎记录得很好,但aiohttp
似乎只有很少的文档,所以我正在努力自己解决一些问题。
首先,我们有没有办法检测页面响应的编码?其次,我们可以要求连接在会话中保持活动状态吗?还是默认情况下是这样的requests
?