18

我写了一个程序,可以总结如下:

def loadHugeData():
    #load it
    return data

def processHugeData(data, res_queue):
    for item in data:
        #process it
        res_queue.put(result)
    res_queue.put("END")

def writeOutput(outFile, res_queue):
    with open(outFile, 'w') as f
        res=res_queue.get()
        while res!='END':
            f.write(res)
            res=res_queue.get()

res_queue = multiprocessing.Queue()

if __name__ == '__main__':
    data=loadHugeData()
    p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue))
    p.start()
    processHugeData(data, res_queue)
    p.join()

真正的代码(尤其是writeOutput())要复杂得多。writeOutput()仅使用它作为参数的这些值(意味着它不引用data

基本上它将一个巨大的数据集加载到内存中并对其进行处理。输出的写入被委托给一个子进程(它实际上写入多个文件,这需要很多时间)。因此,每次处理一个数据项时,它都会通过 res_queue 发送到子进程,然后再根据需要将结果写入文件。

子进程不需要以loadHugeData()任何方式访问、读取或修改加载的数据。子进程只需要使用主进程发送给它的东西res_queue。这让我想到了我的问题和疑问。

在我看来,子进程获得了自己的庞大数据集的副本(使用 检查内存使用情况时top)。这是真的?如果是这样,那么我该如何避免 id (本质上使用双内存)?

我正在使用 Python 2.6,程序在 linux 上运行。

4

1 回答 1

29

multiprocessing模块有效地基于fork创建当前进程副本的系统调用。由于您在fork(或创建multiprocessing.Process)之前加载大量数据,因此子进程会继承数据的副本。

However, if the operating system you are running on implements COW (copy-on-write), there will only actually be one copy of the data in physical memory unless you modify the data in either the parent or child process (both parent and child will share the same physical memory pages, albeit in different virtual address spaces); and even then, additional memory will only be allocated for the changes (in pagesize increments).

You can avoid this situation by calling multiprocessing.Process before you load your huge data. Then the additional memory allocations will not be reflected in the child process when you load the data in the parent.

Edit: reflecting @Janne Karila's comment in the answer, as it is so relevant: "Note also that every Python object contains a reference count that is modified whenever the object is accessed. So, just reading a data structure can cause COW to copy."

于 2013-02-07T11:31:54.627 回答