When you invoke multiprocessing.Pool
, the multiprocessing
module creates several new processes (using os.fork
or similar).
By default, during a fork
, new processes inherit all open file descriptors.
When you invoke subprocess.Popen
with a subprocess.PIPE
argument, the subprocess
module creates some new pipe file descriptors to send data to/from the new process. In this particular case, the pipe is used to send data from the parent process (python) to the child (gzip), and gzip will exit—and thus make the proc.wait()
finish—when all write access to the pipe goes away. (This is what generates "EOF on a pipe": no more write-able file descriptors exist to that pipe.)
Thus, in this case, if you (all in the "original" python process) do this in this sequence:
- create a pipe
- create some
multiprocessing.Pool
processes
- send data to gzip
- close the pipe to gzip
then, due to the behavior of fork
, each of the Pool processes has an os.dup
of the write-to-gzip pipe, so gzip continues waiting for more data, which those Pool processes can (but never do) send. The gzip process will exit as soon as the Pool processes close their pipe descriptors.
Fixing this in real (more complicated) code can be nontrivial. Ideally, what you would like is for multiprocessing.Pool
to know (magically, somehow) which file descriptors should be retained, and which should not, but this is not as simple as "just close a bunch of descriptors in the created child processes":
output = open('somefile', 'a')
def somefunc(arg):
... do some computation, etc ...
output.write(result)
pool = multiprocessing.Pool()
pool.map(somefunc, iterable)
Clearly output.fileno()
must be shared by the worker processes here.
You could try to use the Pool
's initializer
to invoke proc.stdin.close
(or os.close
on a list of fd's), but then you need to arrange to keep track of descriptors-to-close. It's probably simplest to restructure your code to avoid creating a pool "at the wrong time".