python - 示例 urllib3 和 python 中的线程

Question

我正在尝试在简单线程中使用 urllib3 来获取几个 wiki 页面。该脚本将

为每个线程创建 1 个连接（我不明白为什么）并永远挂起。urllib3 和线程的任何提示、建议或简单示例

import threadpool
from urllib3 import connection_from_url

HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True)

def fetch(url, fiedls):
  kwargs={'retries':6}
  return HTTP_POOL.get_url(url, fields, **kwargs)

pool = threadpool.ThreadPool(5)
requests = threadpool.makeRequests(fetch, iterable)
[pool.putRequest(req) for req in requests]

@Lennart 的脚本出现此错误：

http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
 http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters    result = request.callable(*request.args, **request.kwds)
  File "crawler.py", line 9, in fetch
    print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "crawler.py", line 9, in fetch
    print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "crawler.py", line 9, in fetch
    print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "crawler.py", line 9, in fetch
    print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'

添加import threadpool; import urllib3和tpool = threadpool.ThreadPool(4)@user318904的代码后得到这个错误：

Traceback (most recent call last):
  File "crawler.py", line 21, in <module>
    tpool.map_async(fetch, urls)
AttributeError: ThreadPool instance has no attribute 'map_async'

score 2 · Accepted Answer

这是我的看法，一个使用 Python3 和concurrent.futures.ThreadPoolExecutor.

import urllib3
from concurrent.futures import ThreadPoolExecutor

urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
        'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
        'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
        'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
        ]

def download(url, cmanager):
    response = cmanager.request('GET', url)
    if response and response.status == 200:
        print("+++++++++ url: " + url)
        print(response.data[:1024])

connection_mgr = urllib3.PoolManager(maxsize=5)
thread_pool = ThreadPoolExecutor(5)
for url in urls:
    thread_pool.submit(download, url, connection_mgr)

一些备注

我的代码基于Python CookbookBeazley 和 Jones 的类似示例。
我特别喜欢你只需要一个标准模块除了urllib3.
设置非常简单，如果您只考虑副作用download（如打印、保存到文件等），则无需额外的努力来加入线程。
如果你想要不同的东西，ThreadPoolExecutor.submit实际上返回任何download会返回的东西，包裹在Future.
我发现将线程池中的线程数与HTTPConnection连接池中的 ' 数对齐（通过maxsize）很有帮助。否则，当所有线程尝试访问同一服务器时（如示例中所示），您可能会遇到（无害的）警告。

score 1 · Accepted Answer

显然它会为每个线程创建一个连接，否则每个线程应该如何获取页面？您尝试对所有 url 使用由一个 url 建立的相同连接。这几乎不可能是你想要的。

这段代码工作得很好：

import threadpool
from urllib3 import connection_from_url

def fetch(url):
  kwargs={'retries':6}
  conn = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
  print url, conn.get_url(url)
  print "Done!"

pool = threadpool.ThreadPool(4)
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
        'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
        'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
        'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
        ]
requests = threadpool.makeRequests(fetch, urls)

[pool.putRequest(req) for req in requests]
pool.wait()

score 1 · Accepted Answer

线程编程很难，所以我写了workerpool来让你做的事情更容易。

更具体地说，请参阅Mass Downloader示例。

用 urllib3 做同样的事情，它看起来像这样：

import urllib3
import workerpool

pool = urllib3.connection_from_url("foo", maxsize=3)

def download(url):
    r = pool.get_url(url)
    # TODO: Do something with r.data
    print "Downloaded %s" % url

# Initialize a pool, 5 threads in this case
pool = workerpool.WorkerPool(size=5)

# The ``download`` method will be called with a line from the second 
# parameter for each job.
pool.map(download, open("urls.txt").readlines())

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()

有关更复杂的代码，请查看 workerpool.EquippedWorker （以及此处的测试示例）。你可以让游泳池成为toolbox你经过的地方。

score -1 · Accepted Answer

我使用这样的东西：

#excluding setup for threadpool etc

upool = urllib3.HTTPConnectionPool('en.wikipedia.org', block=True)

urls = ['/wiki/2010-11_Premier_League',
        '/wiki/List_of_MythBusters_episodes',
        '/wiki/List_of_Top_Gear_episodes',
        '/wiki/List_of_Unicode_characters',
        ]

def fetch(path):
    # add error checking
    return pool.get_url(path).data

tpool = ThreadPool()

tpool.map_async(fetch, urls)

# either wait on the result object or give map_async a callback function for the results

python - 示例 urllib3 和 python 中的线程

4 回答 4

一些备注

Related

Reference