ipython - 使用 ipython 加速并行数据下载

Question

我有很多（〜1000）小文件要下载。我为此编写了一个函数，以便能够使用 map。下载功能本身使用的请求大大提高了 urllib2 的稳定性，这给了我很多超时。但是，与运行串行映射相比，在例如 4 个进程上并行运行时有轻微的加速：

data = map(get_data, IDs)
data = dview.map_sync(get_data, IDs)

我不确定：

map_sync 是最好的吗？我考虑过使用 map_async 但我需要完整的列表，所以这不应该有什么不同吗？
还可以做些什么来加快这个过程？
我的期望是同时执行 n 次下载，而不是一个接一个

score 1 · Accepted Answer

由于您的下载受限于 IO，我实际上会推荐一个简单的 ThreadPool over IPython.parallel（注意：我是 IPython.parallel 的作者）。上手要容易得多，而且 IPython.parallel 所做的一切都没有真正有益于您提出的案例。

我设置了一个简单的服务器，可以缓慢响应测试请求。

测试一个对我的慢速服务器的简单请求它只是用请求/NUMBER的数量回复任何请求，但是服务器在处理请求时人为地变慢了：

import requests

r = requests.get("http://localhost:8888/10")
r.content

'10'

我们的get_data函数下载给定 ID 的 URL，并解析结果（将 str 的 int 转换为 int）：

def get_data(ID):
    """function for getting data from our slow server"""
    r = requests.get("http://localhost:8888/%i" % ID)
    return int(r.content)

现在测试使用线程池来获取一堆数据，使用不同数量的并发线程：

from multiprocessing.pool import ThreadPool

IDs = range(128)
for nthreads in [1, 2, 4, 8, 16, 32]:
    pool = ThreadPool(nthreads)
    tic = time.time()
    results = pool.map(get_data, IDs)
    toc = time.time()
    print "%3i threads: %5.1f seconds" % (nthreads, toc-tic)


  1 threads:  26.2 seconds
  2 threads:  13.3 seconds
  4 threads:   6.7 seconds
  8 threads:   3.4 seconds
 16 threads:   1.8 seconds
 32 threads:   1.1 seconds

您可以使用它来确定有多少线程对您的情况有意义。您也可以轻松地将 ThreadPool 替换为 ProcessPool，看看是否可以获得更好的结果。

此示例作为 IPython Notebook。

ipython - 使用 ipython 加速并行数据下载

1 回答 1

Related

Reference