2

我正在阅读数百个 HDF 文件并分别处理每个 HDF 的数据。但是,这会花费大量时间,因为它一次只能处理一个 HDF 文件。我只是偶然发现了http://docs.python.org/library/multiprocessing.html,现在我想知道如何使用多处理来加快速度。

到目前为止,我想出了这个:

import numpy as np
from multiprocessing import Pool

def myhdf(date):
    ii      = dates.index(date)
    year    = date[0:4]
    month   = date[4:6]
    day     = date[6:8]
    rootdir = 'data/mydata/'
    filename = 'no2track'+year+month+day
    records = read_my_hdf(rootdir,filename)
    if records.size:
        results[ii] = np.mean(records)

dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))

pool = Pool(len(dates))
pool.map(myhdf,dates)

然而,这显然是不正确的。你能按照我的思路去做我想做的事吗?我需要改变什么?

4

2 回答 2

5

尝试使用 joblib以获得更友好的multiprocessing包装器:

from joblib import Parallel, delayed

def myhdf(date):
    # do work
    return np.mean(records)

results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)
于 2012-10-25T11:38:03.623 回答
2

Pool classes map函数就像标准的 python 库map函数,保证您按照放入它们的顺序返回结果。知道这一点,唯一的另一个技巧是您需要以一致的方式返回结果,然后过滤它们。

import numpy as np
from multiprocessing import Pool

def myhdf(date):
    year    = date[0:4]
    month   = date[4:6]
    day     = date[6:8]
    rootdir = 'data/mydata/'
    filename = 'no2track'+year+month+day
    records = read_my_hdf(rootdir,filename)
    if records.size:
        return np.mean(records)

dates = ['20080105','20080106','20080107','20080108','20080109']

pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)

如果你真的想要结果,你可以使用imap_unordered

于 2012-10-25T09:52:15.313 回答