我的任务是从给定的 url 列表中下载 1M+ 图像。推荐的方法是什么?
在阅读了Greenlet Vs之后。我调查过的线程gevent
,但我无法可靠地运行它。我玩了一个包含 100 个 url 的测试集,有时它在 1.5 秒内完成,但有时它需要超过 30 秒,这很奇怪,因为每个请求的超时*为 0.1,所以它永远不会超过 10 秒。
*见下文代码
我也调查过,grequests
但他们似乎在异常处理方面存在问题。
我的“要求”是我可以
- 检查下载时出现的错误(超时、损坏的图像......),
- 监控处理图像数量的进度和
- 尽可能快。
from gevent import monkey; monkey.patch_all()
from time import time
import requests
from PIL import Image
import cStringIO
import gevent.hub
POOL_SIZE = 300
def download_image_wrapper(task):
return download_image(task[0], task[1])
def download_image(image_url, download_path):
raw_binary_request = requests.get(image_url, timeout=0.1).content
image = Image.open(cStringIO.StringIO(raw_binary_request))
image.save(download_path)
def download_images_gevent_spawn(list_of_image_urls, base_folder):
download_paths = ['/'.join([base_folder, url.split('/')[-1]])
for url in list_of_image_urls]
parameters = [[image_url, download_path] for image_url, download_path in
zip(list_of_image_urls, download_paths)]
tasks = [gevent.spawn(download_image_wrapper, parameter_tuple) for parameter_tuple in parameters]
for task in tasks:
try:
task.get()
except Exception:
print 'x',
continue
print '.',
test_urls = # list of 100 urls
t1 = time()
download_images_gevent_spawn(test_urls, 'download_temp')
print time() - t1