1

尽管我从关于这个主题的大量问题中得到了警告和困惑的感觉,尤其是在 StackOverflow 上,我并行化了一个令人尴尬的并行问题的幼稚版本(基本上是read-image-do-stuff-return的列表许多图像),为每个计算返回生成的 NumPy 数组,并通过callback参数更新一个全局 NumPy 数组,并立即在 8 核机器上获得x5 加速。

现在,由于每次回调调用都需要锁定,我可能没有得到 x8,但我得到的结果令人鼓舞。

我试图找出这是否可以改进,或者这是否是一个好的结果。问题 :

  • 我想返回的 NumPy 数组被腌制了?
  • 底层的 NumPy 缓冲区是复制的还是只是通过引用传递的
  • 我怎样才能找出瓶颈是什么?有什么特别有用的技术吗?
  • 我可以对此进行改进,还是在这种情况下这种改进很常见?
4

2 回答 2

0

我使用sharedmem模块在多个进程之间共享大型 NumPy 数组(当然是通过引用)取得了巨大成功:https ://bitbucket.org/cleemesser/numpy-sharedmem 。基本上它抑制了在 NumPy 数组周围传递时通常发生的酸洗。您所要做的就是,而不是:

import numpy as np
foo = np.empty(1000000)

做这个:

import sharedmem
foo = sharedmem.empty(1000000)

然后你foo就从一个进程传递到另一个进程,比如:

q = multiprocessing.Queue()
...
q.put(foo)

但是请注意,该模块在不正常的程序退出时存在内存泄漏的已知可能性,此处在某种程度上进行了描述:http: //grokbase.com/t/python/python-list/1144s75ps4/multiprocessing-shared-memory-vs -腌制副本

希望这可以帮助。我使用该模块来加速多核机器上的实时图像处理(我的项目是https://github.com/vmlaker/sherlock。)

于 2013-05-20T17:43:53.800 回答
0

Note: This answer is how I ended up solving the issue, but Velimir's answer is more suited if you're doing intense transfers between your processes. I don't, so I didn't need sharedmem.

How I did it

It turns out that the time spent pickling my NumPy arrays was negligible, and I was worrying too much. Essentially, what I'm doing is a MapReduce operation, so I'm doing this :

  • First, on Unix systems, any object you instantiate before spawning a process will be present (and copied) in the context of the process if needed. This is called copy-on-write (COW), and is handled automagically by the kernel, so it's pretty fast (and definitely fast enough for my purposes). The docs contained a lot of warnings about objects needing pickling, but here I didn't need that at all for my inputs.

  • Then, I ended up loading my images from the disk, from within each process. Each image is individually processed (mapped) by its own worker, so I neither lock nor send large batches of data, and I don't have any performance loss.

  • Each worker does its own reduction for the mapped images it handles, then sends the result to the main process with a Queue. The usual outputs I get from the reduction function are 32-bit float images with 4 or 5 channels, with sizes close to 5000 x 5000 pixels (~300 or 400MB of memory each).

  • Finally, I retrieve the intermediate reduction outputs from each process, then do a final reduction in the main process.

I'm not seeing any performance loss when transferring my images with a queue, even when they're eating up a few hundred megabytes. I ran that on a 6 core workstation (with HyperThreading, so the OS sees 12 logical cores), and using multiprocessing with 6 cores was 6 times faster than without using multiprocessing.

(Strangely, running it on the full 12 cores wasn't any faster than 6, but I suspect it has to do with the limitations of HyperThreading.)

Profiling

Another of my concerns was profiling and quantifying how much overhead multiprocessing was generating. Here are a few useful techniques I learned :

  • Compared to the built-in (at least in my shell) time command, the time executable (/usr/bin/time in Ubuntu) gives out much more information, including things such as average RSS, context switches, average %CPU,... I run it like this to get everything I can :

     $ /usr/bin/time -v python test.py
    
  • Profiling (with %run -p or %prun in IPython) only profiles the main process. You can hook cProfile to every process you spawn and save the individual profiles to the disk, like in this answer.

    I suggest adding a DEBUG_PROFILE flag of some kind that toggles this on/off, you never know when you might need it.

  • Last but not least, you can get some more or less useful information from a syscall profile (mostly to see if the OS isn't taking ages transferring heaps of data between the processes), by attaching to one of your running Python processes like :

     $ sudo strace -c -p <python-process-id>
    
于 2013-05-21T19:03:37.987 回答