python - 并行下载多个文件？（Linux/Python？）

Question

我有一个很大的远程文件位置和本地路径列表，我希望它们结束。每个文件都很小，但数量很多。我在 Python 中生成这个列表。

我想在解压缩和处理它们之前尽快（并行）下载所有这些文件。什么是最适合我使用的库或 linux 命令行实用程序？我尝试使用 multiprocessing.pool 来实现这一点，但这不适用于 FTP 库。

我查看了 pycurl，这似乎是我想要的，但我无法让它在 Windows 7 x64 上运行。

score 0 · Accepted Answer

试试wget，一个安装在大多数 Linux 发行版上的命令行实用程序，也可以通过Windows上的Cygwin获得。

你也可以看看Scrapy，它是一个用 Python 编写的库/框架。

score 0 · Accepted Answer

我通常pscp用来做这样的事情，然后使用subprocess.Popen

例如：

pscp_command = '''"c:\program files\putty\pscp.exe" -pw <pwd> -p -scp -unsafe <file location on my   linux machine including machine name and login, can use wildcards here> <where you want the files to go on a windows machine>'''
p = subprocess.Popen( pscp_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE )
stdout, stderr = p.communicate()
p.wait()

当然我假设 linux --> windows

score 0 · Accepted Answer

如果您使用模块中的Pool对象，则应处理 FTP。multiprocessingurllib2

results = {}
def get_url(url):
    try:
        res = urllib2.urlopen(url)
        # url should start with 'ftp:'
        results[url] = res.read()
    except Exception:
        # add more meaningful exception handling if you need it. Eg, retry once etc. 
        results[url] = None
pool = Pool(processes=num_processes)
result = pool.map_async(get_url, url_list)
pool.close()
pool.join()

当然，生成进程会产生一些严重的开销。如果您可以使用像twisted这样的第 3 部分模块，非阻塞请求几乎肯定会更快

开销是否是一个严重的问题将取决于每个文件的下载时间和网络延迟的相对大小。

您可以尝试使用 python 线程而不是进程来实现它，但它会变得有点棘手。请参阅此问题的答案以安全地将 urllib2 与线程一起使用。您还需要使用multiprocessing.pool.ThreadPool而不是常规Pool

score 0 · Accepted Answer

知道这是一篇旧文章，但有一个完美的 linux 实用程序。如果您要从远程主机传输文件，那就lftp太好了！我主要使用它来快速将内容推送到我的 ftp 服务器，但它也适用于使用mirror命令拉取内容。它还可以选择按照您的需要并行复制用户定义的文件数量。如果您想将一些文件从远程路径复制到本地路径，您的命令行将如下所示；

lftp
open ftp://user:password@ftp.site.com
cd some/remote/path
lcd some/local/path
mirror --reverse --parallel=2

但是要非常小心这个命令，就像其他镜像命令一样，如果你搞砸了，你会删除文件。

有关lftp我访问过此站点的更多选项或文档http://lftp.yar.ru/lftp-man.html

python - 并行下载多个文件？（Linux/Python？）

4 回答 4

Related

Reference