python - 在 python 中使用 map.pool 的原因是什么？

Question

我有一个正在运行的命令行程序，并将文本作为参数输入：

somecommand.exe < someparameters_tin.txt

它运行一段时间（通常是一个小时到几个小时的一小部分），然后将结果写入许多文本文件。我正在尝试编写一个脚本来同时启动其中的几个，使用多核机器上的所有内核。在其他操作系统上我会分叉，但这并没有在 Windows 的许多脚本语言中实现。Python 的多处理看起来可以解决问题，所以我想我会尝试一下，尽管我根本不知道 python。我希望有人能告诉我我做错了什么。

我编写了一个脚本（如下），我指向一个目录，如果找到可执行文件和输入文件，然后使用 pool.map 和一个 n 池启动它们，以及一个使用调用的函数。我看到的是，最初（启动第一组 n 个进程）看起来不错，100% 使用 n 个内核。但随后我看到进程处于空闲状态，不使用或仅使用百分之几的 CPU。那里总是有 n 个进程，但它们做的并不多。当他们去写入输出数据文件时似乎会发生这种情况，一旦开始，一切都会陷入困境，整体核心利用率范围从百分之几到偶尔达到 50-60% 的峰值，但从未接近 100%。

如果我可以附上它（编辑：我不能，至少现在不能）这里是进程的运行时间图。较低的曲线是当我打开 n 个命令提示符并手动保持 n 个进程一次运行时，轻松地将计算机保持在 100% 附近。（这条线是规则的，在 32 个不同的进程中从接近 0 小时慢慢增加到 0.7 小时，改变一个参数。）上面的线是这个脚本的某个版本的结果——运行时间平均增加了大约 0.2 小时，并且是更不可预测，就像我采取了底线并添加了 0.2 + 一个随机数。

这是该图的链接：运行时图

编辑：现在我想我可以添加情节了。在此处输入图像描述

我究竟做错了什么？

from multiprocessing import Pool, cpu_count, Lock
from subprocess import call
import glob, time, os, shlex, sys
import random

def launchCmd(s):
    mypid = os.getpid()
    try:
        retcode = call(s, shell=True)
        if retcode < 0:
            print >>sys.stderr, "Child was terminated by signal", -retcode
        else:
            print >>sys.stderr, "Child returned", retcode
    except OSError, e:
        print >>sys.stderr, "Execution failed:", e

if __name__ == '__main__':

    # ******************************************************************
    # change this to the path you have the executable and input files in
    mypath = 'E:\\foo\\test\\'
    # ******************************************************************

    startpath = os.getcwd()
    os.chdir(mypath)
    # find list of input files
    flist = glob.glob('*_tin.txt')
    elist = glob.glob('*.exe')
    # this will not act as expected if there's more than one .exe file in that directory!
    ex = elist[0] + ' < '

    print
    print 'START'
    print 'Path: ', mypath
    print 'Using the executable: ', ex
    nin = len(flist)
    print 'Found ',nin,' input files.'
    print '-----'
    clist = [ex + s for s in flist]
    cores = cpu_count()
    print 'CPU count ', cores
    print '-----'

    # ******************************************************
    # change this to the number of processes you want to run
    nproc = cores -1
    # ******************************************************

    pool = Pool(processes=nproc, maxtasksperchild=1)    # start nproc worker processes
    # mychunk = int(nin/nproc)      # this didn't help
    # list.reverse(clist)           # neither did this, or randomizing the list
    pool.map(launchCmd, clist)      # launch processes
    os.chdir(startpath)             # return to original working directory
    print 'Done'

score 0 · Accepted Answer

I think I know this. When you call map, it breaks the list of tasks into 'chunks' for each process. By default, it uses chunks large enough that it can send one to each process. This works on the assumption that all the tasks take about the same length of time to complete.

In your situation, presumably the tasks can take very different amounts of time to complete. So some workers finish before others, and those CPUs sit idle. If that's the case, then this should work as expected:

pool.map(launchCmd, clist, chunksize=1)

Less efficient, but it should mean that each worker gets more tasks as it finishes until they're all complete.

score 0 · Accepted Answer

进程是否有机会尝试写入公共文件？在 Linux 下，它可能会正常工作，破坏数据但不会减慢速度；但在 Windows 下，一个进程可能会获取文件，而所有其他进程可能会挂起，等待文件可用。

如果你用一些使用 CPU 但不写入磁盘的愚蠢任务替换你的实际任务列表，问题会重现吗？例如，您可能有计算某个大文件的 md5sum 的任务；一旦文件被缓存，其他任务将是纯 CPU，然后单行输出到标准输出。或计算一些昂贵的功能或其他东西。

python - 在 python 中使用 map.pool 的原因是什么？

2 回答 2

Related

Reference