python - pathos pools：在 N 个任务后更新工作进程

Question

我正在构建一个并行 python 应用程序，它本质上调用了一个围绕外部库的 C 包装器。需要并行性才能在所有 CPU 内核上同时运行计算。

我最终使用了pathos.multiprocessing.ProcessPool，但这些池缺少maxtaskperchild标准multiprocessing.Pool类构造函数的参数（请参阅此处的参考资料）。我需要这个功能，因为 C 库依赖于进程时钟来定义一些执行时间限制，这些时间限制最终会在任务堆积起来时达到。

有没有办法让ProcessPool经理在给定数量的任务后更新工作流程？

阐明我的意图的示例代码：

from pathos.pools import ProcessPool
from os import getpid
import collections

def print_pid(task_id):
    pid = getpid()
    return pid

if __name__ == "__main__":
    NUM_TASKS = 50
    MAX_PER_CHILD = 2


    # limit each process to maximum MAX_PER_CHILD tasks
    # we would like the pool to exit the process and spawn a new one
    # when a task counter reaches the limit
    # below argument 'maxtasksperchild' would work with standard 'multiprocessing'
    pool = ProcessPool(ncpu=2, maxtasksperchild=MAX_PER_CHILD)
    results = pool.map(print_pid, range(NUM_TASKS), chunksize=1)

    tasks_per_pid = dict(collections.Counter(results))
    print(tasks_per_pid)

# printed result
# {918: 8, 919: 6, 920: 6, 921: 6, 922: 6, 923: 6, 924: 6, 925: 6}
# observe that all processes did more than MAX_PER_CHILD tasks

我试过的

maxtasksperchild在构造函数中设置ProcessPool（参见上面的朴素示例）似乎没有做任何事情
调用sys.exit()worker函数使程序挂起
我在深入研究源代码时发现了一些提示

score 2 · Accepted Answer

里面有pathos.multiprocessing两个池： ProcessPool和_ProcessPool。前者被设计为具有增强的池生命周期，可最大限度地减少启动时间，并具有持久性和重新启动功能——但是，缺少一些“ multiprocessing”关键字。后者 ( _ProcessPool) 是 API 设计的一个层次，并提供与接口相同的multiprocessing Pool接口（但使用dill）。所以，看看_ProcessPool.

python - pathos pools：在 N 个任务后更新工作进程

我试过的

1 回答 1

Related

Reference