python - 使用 NumPy 在 Python 中进行简单的多处理

Question

尽管我从关于这个主题的大量问题中得到了警告和困惑的感觉，尤其是在 StackOverflow 上，我并行化了一个令人尴尬的并行问题的幼稚版本（基本上是read-image-do-stuff-return的列表许多图像），为每个计算返回生成的 NumPy 数组，并通过callback参数更新一个全局 NumPy 数组，并立即在 8 核机器上获得x5 加速。

现在，由于每次回调调用都需要锁定，我可能没有得到 x8，但我得到的结果令人鼓舞。

我试图找出这是否可以改进，或者这是否是一个好的结果。问题：

我想返回的 NumPy 数组被腌制了？
底层的 NumPy 缓冲区是复制的还是只是通过引用传递的？
我怎样才能找出瓶颈是什么？有什么特别有用的技术吗？
我可以对此进行改进，还是在这种情况下这种改进很常见？

score 0 · Accepted Answer

我使用sharedmem模块在多个进程之间共享大型 NumPy 数组（当然是通过引用）取得了巨大成功：https ://bitbucket.org/cleemesser/numpy-sharedmem 。基本上它抑制了在 NumPy 数组周围传递时通常发生的酸洗。您所要做的就是，而不是：

import numpy as np
foo = np.empty(1000000)

做这个：

import sharedmem
foo = sharedmem.empty(1000000)

然后你foo就从一个进程传递到另一个进程，比如：

q = multiprocessing.Queue()
...
q.put(foo)

但是请注意，该模块在不正常的程序退出时存在内存泄漏的已知可能性，此处在某种程度上进行了描述：http: //grokbase.com/t/python/python-list/1144s75ps4/multiprocessing-shared-memory-vs -腌制副本。

希望这可以帮助。我使用该模块来加速多核机器上的实时图像处理（我的项目是https://github.com/vmlaker/sherlock。）

score 0 · Accepted Answer

Note: This answer is how I ended up solving the issue, but Velimir's answer is more suited if you're doing intense transfers between your processes. I don't, so I didn't need sharedmem.

How I did it

It turns out that the time spent pickling my NumPy arrays was negligible, and I was worrying too much. Essentially, what I'm doing is a MapReduce operation, so I'm doing this :

First, on Unix systems, any object you instantiate before spawning a process will be present (and copied) in the context of the process if needed. This is called copy-on-write (COW), and is handled automagically by the kernel, so it's pretty fast (and definitely fast enough for my purposes). The docs contained a lot of warnings about objects needing pickling, but here I didn't need that at all for my inputs.
Then, I ended up loading my images from the disk, from within each process. Each image is individually processed (mapped) by its own worker, so I neither lock nor send large batches of data, and I don't have any performance loss.
Each worker does its own reduction for the mapped images it handles, then sends the result to the main process with a Queue. The usual outputs I get from the reduction function are 32-bit float images with 4 or 5 channels, with sizes close to 5000 x 5000 pixels (~300 or 400MB of memory each).
Finally, I retrieve the intermediate reduction outputs from each process, then do a final reduction in the main process.

I'm not seeing any performance loss when transferring my images with a queue, even when they're eating up a few hundred megabytes. I ran that on a 6 core workstation (with HyperThreading, so the OS sees 12 logical cores), and using multiprocessing with 6 cores was 6 times faster than without using multiprocessing.

(Strangely, running it on the full 12 cores wasn't any faster than 6, but I suspect it has to do with the limitations of HyperThreading.)

Profiling

Another of my concerns was profiling and quantifying how much overhead multiprocessing was generating. Here are a few useful techniques I learned :

Compared to the built-in (at least in my shell) time command, the time executable (/usr/bin/time in Ubuntu) gives out much more information, including things such as average RSS, context switches, average %CPU,... I run it like this to get everything I can :
```
 $ /usr/bin/time -v python test.py
```
Profiling (with %run -p or %prun in IPython) only profiles the main process. You can hook cProfile to every process you spawn and save the individual profiles to the disk, like in this answer.

I suggest adding a DEBUG_PROFILE flag of some kind that toggles this on/off, you never know when you might need it.
Last but not least, you can get some more or less useful information from a syscall profile (mostly to see if the OS isn't taking ages transferring heaps of data between the processes), by attaching to one of your running Python processes like :
```
 $ sudo strace -c -p <python-process-id>
```

python - 使用 NumPy 在 Python 中进行简单的多处理

2 回答 2

How I did it

Profiling

Related

Reference