[[ 问题 ]]
我使用 psutil 库设置子进程的 cpu 亲和力,并使用 mpirun 和作业调度程序运行。
从作业调度程序中删除作业后,我 ssh 进入节点。
当我使用ps aux检查时,python主进程和11个python子进程仍然继续运行并更新日志。
只有 mpirun 进程被杀死。
psutil 库在设置 python 子进程的 cpu 亲和性时不会更改 pid。
在不使用 mpirun 运行它的情况下,作业调度程序可以毫无问题地终止 python 子进程。
[[ 问题 ]]
当作业调度程序删除作业时,如何使进程真正被杀死?
谢谢。
[[测试代码]]
import logging
import psutil
import time
import multiprocessing as mp
def main():
logging.basicConfig(format='%(asctime)s %(message)s', level=logging.INFO)
# set_process_affinity(0)
num_cpu = mp.cpu_count()
logging.info('call worker')
# Starts num_cpu - 1 subprocesses
# For a node with 12 cpu, this makes 11 processes.
mp_pool = mp.Pool(processes = num_cpu - 1)
result_list = [ mp_pool.apply_async(worker,(i,)) for i in range(1,num_cpu)]
mp_pool.close()
mp_pool.join()
def set_process_affinity(cpu_id):
psutil_proc = psutil.Process()
logging.info("cpu_id #%d :: proc_info_before %s" % (cpu_id, psutil_proc))
psutil_proc.cpu_affinity([cpu_id])
psutil_proc = psutil.Process()
logging.info("cpu_id #%d :: proc_info_after %s" % (cpu_id, psutil_proc))
def worker(worker_id):
cpu_id = worker_id
set_process_affinity(cpu_id)
logging.basicConfig(format='%(asctime)s %(message)s', level=logging.INFO)
for cycle in range(10000):
logging.info("worker #%d :: cycle %d" % (worker_id, cycle))
waste_time = []
for i in xrange(1000000):
waste_time += [i]
#time.sleep(10)
if __name__ == '__main__':
main()
[[作业脚本中的命令]]
使用 mpirun (openmpi-1.8.1):
mpirun -np 1 --map-by node python2.7 -u ps_test.py &> ps_test.log
没有 mpirun:
python2.7 -u ps_test.py &> ps_test.log
[[ 使用 ssh 检查到节点,然后运行 ps aux | grep "rxu" ]]
使用 mpirun:
= 在使用任务调度程序(使用 mpirun)杀死作业之前 =
rxu 3686 0.3 0.0 105856 4396 ? Sl 11:15 0:00 mpirun -np 1 --map-by node python2.7 -u ps_test.py
rxu 3688 0.6 0.1 172408 11660 ? Sl 11:15 0:00 python2.7 -u ps_test.py
rxu 3689 96.0 0.4 206692 40656 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3690 96.0 0.4 206692 40660 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3691 96.0 0.4 206692 40676 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3692 96.0 0.4 206692 40680 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3693 96.0 0.4 206696 40688 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3694 102 0.4 206696 40684 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3695 102 0.4 203328 40684 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3696 102 0.4 203328 40668 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3697 102 0.4 203328 40668 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3698 101 0.4 203328 40668 ? R 11:15 0:15 python2.7 -u ps_test.py
rxu 3699 102 0.4 203332 40672 ? R 11:15 0:15 python2.7 -u ps_test.py
... some processes from pts/1 including ssh into the node
= 使用任务调度程序(使用 mpirun)杀死作业后 =
The mpirun process get killed.
The python main process (the one with 0.0 cpu used) lives
The 11 python subprocess lives (none got killed).
The machine has 12 cpu.
rxu 3688 0.0 0.1 172408 11660 ? Sl 11:15 0:00 python2.7 -u ps_test.py
rxu 3689 99.6 0.4 206692 40708 ? R 11:15 3:34 python2.7 -u ps_test.py
rxu 3690 99.6 0.4 206692 40732 ? R 11:15 3:34 python2.7 -u ps_test.py
rxu 3691 99.6 0.4 206692 40724 ? R 11:15 3:34 python2.7 -u ps_test.py
rxu 3692 99.6 0.4 206692 40728 ? R 11:15 3:34 python2.7 -u ps_test.py
rxu 3693 99.6 0.4 206696 40736 ? R 11:15 3:34 python2.7 -u ps_test.py
rxu 3694 100 0.4 206696 40732 ? R 11:15 3:34 python2.7 -u ps_test.py
rxu 3695 100 0.4 203328 40732 ? R 11:15 3:34 python2.7 -u ps_test.py
rxu 3696 99.9 0.4 203328 40720 ? R 11:15 3:33 python2.7 -u ps_test.py
rxu 3697 100 0.4 203328 40720 ? R 11:15 3:34 python2.7 -u ps_test.py
rxu 3698 99.9 0.4 203328 40720 ? R 11:15 3:33 python2.7 -u ps_test.py
rxu 3699 100 0.4 203332 40724 ? R 11:15 3:34 python2.7 -u ps_test.py
= 我得到的日志(使用 mpirun)=
(The pid didn't change upon setting cpu affinity of the subprocess)
2016-06-09 11:15:45,280 call worker
2016-06-09 11:15:45,333 cpu_id #1 :: proc_info_before psutil.Process(pid=3689, name='python2.7')
2016-06-09 11:15:45,334 cpu_id #1 :: proc_info_after psutil.Process(pid=3689, name='python2.7')
2016-06-09 11:15:45,335 worker #1 :: cycle 0
2016-06-09 11:15:45,335 cpu_id #2 :: proc_info_before psutil.Process(pid=3690, name='python2.7')
2016-06-09 11:15:45,336 cpu_id #2 :: proc_info_after psutil.Process(pid=3690, name='python2.7')
没有 mpirun:
= 在使用任务调度程序终止作业之前(没有 mpirun)=
main process is the one with 0.3 %cpu used
11 works are those with 101 %cpu used.
There are 12 cpu on the machine.
rxu 3399 0.3 0.1 237940 11660 ? Sl 11:06 0:00 python2.7 -u ps_test.py
rxu 3400 101 0.4 206688 40648 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3401 101 0.4 206688 40652 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3402 101 0.4 206688 40672 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3403 101 0.4 206688 40676 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3404 101 0.4 206692 40684 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3405 101 0.4 206692 40680 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3406 101 0.4 203324 40680 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3407 101 0.4 203324 40664 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3408 101 0.4 203324 40664 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3409 101 0.4 203324 40664 ? R 11:06 0:35 python2.7 -u ps_test.py
rxu 3410 101 0.4 203328 40668 ? R 11:06 0:35 python2.7 -u ps_test.py
... some processes from pts/1 including ssh into the node
= 使用任务调度程序杀死作业后(没有 mpirun)=
nothng. no python processes
... except some processes from pts/1 including ssh into the node