python - 解决Hettinger示例的异步三重奏方法

Question

Raymond Hettinger就 Python 中的并发进行了一次演讲，其中一个示例如下所示：

import urllib.request

sites = [
    'https://www.yahoo.com/',
    'http://www.cnn.com',
    'http://www.python.org',
    'http://www.jython.org',
    'http://www.pypy.org',
    'http://www.perl.org',
    'http://www.cisco.com',
    'http://www.facebook.com',
    'http://www.twitter.com',
    'http://www.macrumors.com/',
    'http://arstechnica.com/',
    'http://www.reuters.com/',
    'http://abcnews.go.com/',
    'http://www.cnbc.com/',
]

for url in sites:
    with urllib.request.urlopen(url) as u:
        page = u.read()
        print(url, len(page))

本质上，我们跟踪这些链接并打印接收到的字节数，运行大约需要 20 秒。

今天我发现trio库有非常友好的 api。但是，当我尝试将它与这个相当基本的示例一起使用时，我没有做对。

第一次尝试（运行大约相同的 20 秒）：

import urllib.request
import trio, time

sites = [
    'https://www.yahoo.com/',
    'http://www.cnn.com',
    'http://www.python.org',
    'http://www.jython.org',
    'http://www.pypy.org',
    'http://www.perl.org',
    'http://www.cisco.com',
    'http://www.facebook.com',
    'http://www.twitter.com',
    'http://www.macrumors.com/',
    'http://arstechnica.com/',
    'http://www.reuters.com/',
    'http://abcnews.go.com/',
    'http://www.cnbc.com/',
]


async def show_len(sites):
    t1 = time.time()
    for url in sites:
        with urllib.request.urlopen(url) as u:
            page = u.read()
            print(url, len(page))
    print("code took to run", time.time() - t1)

if __name__ == "__main__":
    trio.run(show_len, sites)

第二个（相同的速度）：

import urllib.request
import trio, time

sites = [
    'https://www.yahoo.com/',
    'http://www.cnn.com',
    'http://www.python.org',
    'http://www.jython.org',
    'http://www.pypy.org',
    'http://www.perl.org',
    'http://www.cisco.com',
    'http://www.facebook.com',
    'http://www.twitter.com',
    'http://www.macrumors.com/',
    'http://arstechnica.com/',
    'http://www.reuters.com/',
    'http://abcnews.go.com/',
    'http://www.cnbc.com/',
]

async def link_user(url):
    with urllib.request.urlopen(url) as u:
        page = u.read()
        print(url, len(page))

async def show_len(sites):
    t1 = time.time()
    for url in sites:
        await link_user(url)
    print("code took to run", time.time() - t1)


if __name__ == "__main__":
    trio.run(show_len, sites)

那么这个例子应该如何使用 trio 来处理呢？

score 28 · Accepted Answer

两件事情：

首先，异步的重点是并发。它不会让事情变得神奇地更快。它只是提供了一个工具包，用于同时执行多项操作（这可能比按顺序执行要快）。如果您希望事情同时发生，那么您需要明确提出请求。在三重奏中，您执行此操作的方式是创建一个托儿所，然后调用它的start_soon方法。例如：

async def show_len(sites):
    t1 = time.time()
    async with trio.open_nursery() as nursery:
        for url in sites:
            nursery.start_soon(link_user, url)
    print("code took to run", time.time() - t1)

但是，如果您尝试进行此更改然后运行代码，您会发现它仍然没有更快。为什么不？要回答这个问题，我们需要稍微备份一下并了解“异步”并发的基本概念。在异步代码中，我们可以有并发任务，但 trio 实际上在任何给定时间只运行其中一个。所以你不能让两个任务同时做某事。但是，您可以有两个（或更多）任务坐下来等待同时。在这样的程序中，大部分时间花在处理 HTTP 请求上，等待响应返回，因此可以通过使用并发任务来获得加速：我们启动所有任务，然后他们每个人都运行一段时间以发送请求，停止等待响应，然后在等待下一个运行一段时间，发送请求，停止等待响应，然后在等待下一个运行......你明白了。

好吧，实际上，在 Python 中，到目前为止我所说的一切也适用于线程，因为 GIL 意味着即使您有多个线程，一次实际上只能运行一个。

在 Python 中，异步并发和基于线程的并发之间的最大区别在于，在基于线程的并发中，解释器可以随时暂停任何线程并切换到运行另一个线程。在异步并发中，我们只在源代码中标记的特定点在任务之间切换——这就是await关键字的用途，它显示了一个任务可能在哪里暂停以让另一个任务运行。这样做的好处是它可以更容易地推理您的程序，因为不同线程/任务可以交错并意外相互干扰的方式要少得多。缺点是可以编写不在await正确位置使用的代码，这意味着我们不能切换到另一个任务。特别是，如果我们停止并等待某事，但没有用标记它await，那么我们的整个程序将停止，而不仅仅是进行阻塞调用的特定任务。

现在让我们再次查看您的示例代码：

async def link_user(url):
    with urllib.request.urlopen(url) as u:
        page = u.read()
        print(url, len(page))

请注意，link_user根本不使用await。这就是阻止我们的程序同时运行的原因：每次调用时link_user，它都会发送请求，然后等待响应，不让其他任何东西运行。

如果您在开头添加一些打印调用，您可以更轻松地看到这一点：

async def link_user(url):
    print("starting to fetch", url)
    with urllib.request.urlopen(url) as u:
        page = u.read()
        print("finished fetching", url, len(page))

它打印如下内容：

starting to fetch https://www.yahoo.com/
finished fetching https://www.yahoo.com/ 520675
starting to fetch http://www.cnn.com
finished fetching http://www.cnn.com 171329
starting to fetch http://www.python.org
finished fetching http://www.python.org 49239
[... you get the idea ...]

为了避免这种情况，我们需要切换到一个专为 trio 设计的 HTTP 库。希望将来我们会有熟悉的选项，例如urllib3和requests。在那之前，你最好的选择可能是问。

因此，这是您重写的代码以link_user同时运行调用，并使用异步 HTTP 库：

import trio, time
import asks
asks.init("trio")

sites = [
    'https://www.yahoo.com/',
    'http://www.cnn.com',
    'http://www.python.org',
    'http://www.jython.org',
    'http://www.pypy.org',
    'http://www.perl.org',
    'http://www.cisco.com',
    'http://www.facebook.com',
    'http://www.twitter.com',
    'http://www.macrumors.com/',
    'http://arstechnica.com/',
    'http://www.reuters.com/',
    'http://abcnews.go.com/',
    'http://www.cnbc.com/',
]

async def link_user(url):
    print("starting to fetch", url)
    r = await asks.get(url)
    print("finished fetching", url, len(r.content))

async def show_len(sites):
    t1 = time.time()
    async with trio.open_nursery() as nursery:
        for url in sites:
            nursery.start_soon(link_user, url)
    print("code took to run", time.time() - t1)


if __name__ == "__main__":
    trio.run(show_len, sites)

现在这应该比顺序版本运行得更快。

在三重奏教程中有更多关于这两点的讨论：https ://trio.readthedocs.io/en/latest/tutorial.html#async-functions

您可能还会发现此演讲很有用：https ://www.youtube.com/watch?v=i-R704I8ySE

score 0 · Accepted Answer

与httpx的异步示例，它与asyncio和兼容trio，并且具有与requests.

import trio, time
import httpx

sites = [
    'https://www.yahoo.com/',
    'https://www.cnn.com',
    'https://www.python.org',
    'https://www.jython.org',
    'https://www.pypy.org',
    'https://www.perl.org',
    'https://www.cisco.com',
    'https://www.facebook.com',
    'https://www.twitter.com',
    'https://www.macrumors.com/',
    'https://arstechnica.com/',
    'https://www.reuters.com/',
    'https://abcnews.go.com/',
    'https://www.cnbc.com/',
]


async def link_user(url):
    print("starting to fetch", url)
    async with httpx.AsyncClient() as client:
        r = await client.get(url)
    print("finished fetching", url, len(r.content))

async def show_len(sites):
    t1 = time.time()
    async with trio.open_nursery() as nursery:
        for url in sites:
            nursery.start_soon(link_user, url)
    print("code took to run", time.time() - t1)


if __name__ == "__main__":
    trio.run(show_len, sites)

python - 解决Hettinger示例的异步三重奏方法

2 回答 2

Related

Reference