我最近陷入了Double Fork的陷阱,并在终于找到答案之前登陆了这个页面。即使问题不同,症状也是相同的:
下面显示了基于 SNMP 守护程序的示例的最小测试代码
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
int main(int argc, char* argv[])
{
//We omit the -f option (do not Fork) to reproduce the problem
char * options[]={"/usr/local/sbin/snmpd",/*"-f","*/-d","--master=agentx", "-Dagentx","--agentXSocket=tcp:localhost:1706", "udp:10161", (char*) NULL};
pid_t pid = fork();
if ( 0 > pid ) return -1;
switch(pid)
{
case 0:
{ //Child launches SNMP daemon
execv(options[0],options);
exit(-2);
break;
}
default:
{
sleep(10); //Simulate "long" activity
kill(pid,SIGTERM);//kill what should be child,
//i.e the SNMP daemon I assume
printf("Signal sent to %d\n",pid);
sleep(10); //Simulate "long" operation before closing
waitpid(pid);
printf("SNMP should be now down\n");
getchar();//Blocking (for observation only)
break;
}
}
printf("Bye!\n");
}
在第一阶段,主进程 (7699) 启动 SNMP 守护进程 (7700),但我们可以看到这个进程现在是Defunct/Zombie。在旁边,我们可以看到另一个带有我们指定选项的进程(7702)
[nils@localhost ~]$ ps -ef | tail
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7699 2832 0 23:11 pts/0 00:00:00 ./main
nils 7700 7699 0 23:11 pts/0 00:00:00 [snmpd] <defunct>
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7727 3706 0 23:11 pts/1 00:00:00 ps -ef
nils 7728 3706 0 23:11 pts/1 00:00:00 tail
在模拟 10 秒后,我们将尝试杀死我们知道的唯一进程(7700)。我们最终通过waitpid()取得了成功。但是进程 7702 仍然在这里
[nils@localhost ~]$ ps -ef | tail
root 7431 2 0 23:00 ? 00:00:00 [kworker/u256:1]
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7699 2832 0 23:11 pts/0 00:00:00 ./main
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7751 3706 0 23:12 pts/1 00:00:00 ps -ef
nils 7752 3706 0 23:12 pts/1 00:00:00 tail
在给 getchar() 函数一个字符后,我们的主进程终止,但 pid 为 7002 的 SNMP 守护进程仍然存在
[nils@localhost ~]$ ps -ef | tail
postfix 7399 1511 0 22:58 ? 00:00:00 pickup -l -t unix -u
root 7431 2 0 23:00 ? 00:00:00 [kworker/u256:1]
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7765 3706 0 23:12 pts/1 00:00:00 ps -ef
nils 7766 3706 0 23:12 pts/1 00:00:00 tail
结论
我们忽略了双叉机制的事实让我们认为kill动作没有成功。但实际上我们只是杀死了错误的进程!!
通过添加-f选项( Do Not (Double) Fork )一切都按预期进行