python - 与 urllib2 或其他 http 库的多个（异步）连接？

Question

我有这样的代码。

for p in range(1,1000):
    result = False
    while result is False:
        ret = urllib2.Request('http://server/?'+str(p))
        try:
            result = process(urllib2.urlopen(ret).read())
        except (urllib2.HTTPError, urllib2.URLError):
            pass
    results.append(result)

我想同时提出两个或三个请求来加速这一点。我可以为此使用 urllib2 吗？如何使用？如果不是，我应该使用哪个其他库？谢谢。

score 11 · Accepted Answer

所以，现在是 2016 年，我们有 Python 3.4+ 和用于异步 I/O的内置asyncio模块。我们可以使用aiohttp作为 HTTP 客户端来并行下载多个 URL。

import asyncio
from aiohttp import ClientSession

async def fetch(url):
    async with ClientSession() as session:
        async with session.get(url) as response:
            return await response.read()

async def run(loop, r):
    url = "http://localhost:8080/{}"
    tasks = []
    for i in range(r):
        task = asyncio.ensure_future(fetch(url.format(i)))
        tasks.append(task)

    responses = await asyncio.gather(*tasks)
    # you now have all response bodies in this variable
    print(responses)

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(loop, 4))
loop.run_until_complete(future)

来源：从http://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html复制粘贴

score 10 · Accepted Answer

您可以使用异步 IO 来执行此操作。

请求+ gevent = grequests

GRequests 允许您使用带有 Gevent 的请求来轻松地发出异步 HTTP 请求。

import grequests

urls = [
    'http://www.heroku.com',
    'http://tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

rs = (grequests.get(u) for u in urls)
grequests.map(rs)

score 9 · Accepted Answer

看看gevent——一个基于协程的 Python 网络库，它使用 greenlet 在 libevent 事件循环之上提供高级同步 API。

例子：

#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.

"""Spawn multiple workers and wait for them to complete"""

urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2


def print_head(url):
    print 'Starting %s' % url
    data = urllib2.urlopen(url).read()
    print '%s: %s bytes: %r' % (url, len(data), data[:50])

jobs = [gevent.spawn(print_head, url) for url in urls]

gevent.joinall(jobs)

score 2 · Accepted Answer

使用现代异步库的 2021 年答案

2016 年的答案很好，但我想我会用 httpx 而不是 aiohttp 提出另一个答案，因为 httpx 只是一个客户端并且支持不同的异步环境。我省略了 OP 的 for 循环，其 url 是由连接到字符串的数字构建的，因为我觉得这是一个更通用的答案。

import asyncio
import httpx

# you can have synchronous code here

async def getURL(url):
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        # we could have some synchronous code here too
        # to do CPU bound tasks on what we just fetched for instance
        return response

# more synchronous code can go here

async def main():
    response1, response2 = await asyncio.gather(getURL(url1),getURL(url2))
    # do things with the responses
    # you can also have synchronous code here

asyncio.run(main())

在等待的任务完成后，async with块中的任何await之后的代码将立即运行。这是一个解析您的响应的好地方，而无需等待您的所有请求完成。

asyncio.gather之后的代码将在所有任务完成后运行。这是一个执行需要来自所有请求的信息的操作的好地方，这些信息可能在gather调用的异步函数中进行了预处理。

score 1 · Accepted Answer

我知道这个问题有点老了，但我认为推广另一个基于 requests 库的异步解决方案可能会很有用。

list_of_requests = ['http://moop.com', 'http://doop.com', ...]

from simple_requests import Requests
for response in Requests().swarm(list_of_requests):
    print response.content

文档在这里： http: //pythonhosted.org/simple-requests/

score 0 · Accepted Answer

0

要么找出线程，要么使用 Twisted（异步）。

于 2010-11-07T22:05:51.943 回答

score 0 · Accepted Answer

也许使用多处理并划分你在 2 个进程左右的工作。

这是一个示例（未经测试）

import multiprocessing
import Queue
import urllib2


NUM_PROCESS = 2
NUM_URL = 1000


class DownloadProcess(multiprocessing.Process):
    """Download Process """

    def __init__(self, urls_queue, result_queue):

        multiprocessing.Process.__init__(self)

        self.urls = urls_queue
        self.result = result_queue

    def run(self):
        while True:

             try:
                 url = self.urls.get_nowait()
             except Queue.Empty:
                 break

             ret = urllib2.Request(url)
             res = urllib2.urlopen(ret)

             try:
                 result = res.read()
             except (urllib2.HTTPError, urllib2.URLError):
                     pass

             self.result.put(result)


def main():

    main_url = 'http://server/?%s'

    urls_queue = multiprocessing.Queue()
    for p in range(1, NUM_URL):
        urls_queue.put(main_url % p)

    result_queue = multiprocessing.Queue()

    for i in range(NUM_PROCESS):
        download = DownloadProcess(urls_queue, result_queue)
        download.start()

    results = []
    while result_queue:
        result = result_queue.get()
        results.append(result)

    return results

if __name__ == "__main__":
    results = main()

    for res in results:
        print(res)

python - 与 urllib2 或其他 http 库的多个（异步）连接？

7 回答 7

使用现代异步库的 2021 年答案

Related

Reference