python - Python：快速下载多个文件

Question

在 Python 中如何快速下载一堆文件？urllib.urlretrieve()速度很慢，我不太确定该怎么做。

我有一个包含 15-20 个文件的列表要下载，而且下载一个文件需要很长时间。每个文件大约 2-4 mb。

我以前从来没有这样做过，我不确定我应该从哪里开始。我应该使用线程并一次下载几个吗？或者我应该使用线程来下载每个文件的片段，但一次下载一个文件，还是应该使用线程？

score 3 · Accepted Answer

尝试使用 python 的 wget 模块。这是一个代码片段。

import wget
wget.download(url, out = path)

score 1 · Accepted Answer

urllib.urlretrieve() 非常慢

真的吗？如果您有 15-20 个文件，每个文件大小为 2-4mb，那么我只需将它们排成一行并下载它们。瓶颈将成为您的服务器和您自己的带宽。所以恕我直言，在这种情况下几乎不值得线程或尝试任何聪明的东西......

score 1 · Accepted Answer

一种解决方案（不是 Python 特定的）是将下载 URL 保存在另一个文件中，然后使用下载管理器程序（例如wget或aria2 ）下载它们。您可以从 Python 程序调用下载管理器。

但正如@Jon 所提到的，这对于您的情况并不是必需的。urllib.urlretrieve()就够了！

另一种选择是使用Mechanize下载文件。

score 0 · Accepted Answer

stream.py是一个基于数据流编程思想的并行 python（通过线程或进程）的有点实验性但可爱的 UI：示例中提供了一个 URL-retriever：

https://github.com/aht/stream.py/blob/master/example/retrieve_urls.py

因为它很短：

#!/usr/bin/env python

"""
Demonstrate the use of a ThreadPool to simultaneously retrieve web pages.
"""

import urllib2
from stream import ThreadPool

URLs = [
    'http://www.cnn.com/',
    'http://www.bbc.co.uk/',
    'http://www.economist.com/',
    'http://nonexistant.website.at.baddomain/',
    'http://slashdot.org/',
    'http://reddit.com/',
    'http://news.ycombinator.com/',
]

def retrieve(urls, timeout=30):
    for url in urls:
        yield url, urllib2.urlopen(url, timeout=timeout).read()

if __name__ == '__main__':
    retrieved = URLs >> ThreadPool(retrieve, poolsize=4)
    for url, content in retrieved:
        print '%r is %d bytes' % (url, len(content))
    for url, exception in retrieved.failure:
        print '%r failed: %s' % (url, exception)

您只需要替换urllib2.urlopen(url, timeout=timeout).read()为urlretrieve....

python - Python：快速下载多个文件

4 回答 4

Related

Reference