cluster-computing - 起搏器因超时而终止主监视器脚本（LSB）时的孤立进程

Question

在我们的起搏器 + corosync 集群中

Last updated: Thu Oct 22 21:16:33 2015 Last change: Thu Oct 22 17:25:13 2015 via cibadmin on aws015 Stack: corosync Current DC: aws015 (2887647247) - partition with quorum Version: 1.1.10-42f2063 4 Nodes configured 16 Resources configured

我们有跟随情况。我们编写 python LSB 脚本，检查某些应用程序的状态，并将其作为资源：

primitive pm2_app_gardenscapesDynamo_lsb lsb:pm2_app_gardenscapesDynamo \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" \ op monitor interval="30s" timeout="60s" on-fail="restart" \ meta failure-timeout="10s" migration-threshold="1"

此检查由可以挂起的实用程序进行（LSB 脚本启动该实用程序，并等待它的回复）。因此，当起搏器超时时，它会杀死我们的 python 脚本，但挂起的实用程序仍然存在于内存中，并且不会死掉。

有可能防止这种情况吗？

score 1 · Accepted Answer

您需要升级到起搏器 1.1.12 或更高版本。

发生这种情况的原因是因为起搏器在自己的进程组中启动资源代理。当操作超时时，pacemaker (1.1.10) 仅终止 RA，将它可能已启动的任何子进程保留为“孤立”。

版本 1.1.12 反而杀死了整个进程组。

相关代码在lib/common/mainloop.c，函数child_kill_helper

cluster-computing - 起搏器因超时而终止主监视器脚本（LSB）时的孤立进程

1 回答 1

Related