python - asyncio web 抓取 101：使用 aiohttp 获取多个 url

Question

在较早的问题中，其中一位作者aiohttp提出了使用 aiohttp 获取多个 url 的好方法，该方法使用来自以下的新async with语法Python 3.5：

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.wait([loop.create_task(fetch(session, url))
                                  for url in urls])
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))
        # do something with the the_results

但是，当其中一个session.get(url)请求中断时（如上所述，因为http://SDFKHSKHGKLHSKLJHGSDFKSJH.com），错误不会被处理并且整个事情都会中断。

我寻找插入有关结果的测试的方法session.get(url)，例如寻找 atry ... except ...或 a 的位置，if response.status != 200:但我只是不了解如何使用async with以及await各种对象。

由于async with还很新，所以例子不多。如果一个asyncio向导可以展示如何做到这一点，这对许多人来说将是非常有帮助的。毕竟，大多数人想要测试的第一件事asyncio就是同时获取多个资源。

目标

目标是我们可以检查the_results并快速查看：

此 url 失败（以及原因：状态代码，可能是异常名称），或
这个网址有效，这是一个有用的响应对象

score 25 · Accepted Answer

我会使用gather而不是wait，它可以将异常作为对象返回，而不引发它们。然后您可以检查每个结果，如果它是某个异常的实例。

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
        return_exceptions=True  # default is false, that would raise
    )

    # for testing purposes only
    # gather returns results in the order of coros
    for idx, url in enumerate(urls):
        print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = [
        'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
        'http://google.com',
        'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))

测试：

$python test.py 
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
http://google.com: OK
http://twitter.com: OK

score 9 · Accepted Answer

我远非 asyncio 专家，但您想捕获捕获套接字错误所需的错误：

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                print(response.status == 200)
                return await response.text()
        except socket.error as e:
            print(e.strerror)

运行代码并打印the_results：

Cannot connect to host sdfkhskhgklhskljhgsdfksjh.com:80 ssl:False [Can not connect to sdfkhskhgklhskljhgsdfksjh.com:80 [Name or service not known]]
True
True
({<Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!DOCTYPE ht...y>\n</html>\n'>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result=None>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!doctype ht.../body></html>'>}, set())

您可以看到我们捕获了错误，并且进一步的调用仍然成功返回了 html。

我们可能真的应该捕获一个OSError，因为 socket.error 是自 python 3.3 以来不推荐使用的 OSError 别名：

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                return await response.text()
        except OSError as e:
            print(e)

如果您还想检查响应是否为 200，请将您的 if 也放入 try 中，您可以使用 reason 属性获取更多信息：

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                if response.status != 200:
                    print(response.reason)
                return await response.text()
        except OSError as e:
            print(e.strerror)

python - asyncio web 抓取 101：使用 aiohttp 获取多个 url

2 回答 2

Related

Reference