perl - Perl、IO:Async::Loop 和管道到 sendmail 的死锁

Question

当我们尝试从 Perl FCGI 进程发送电子邮件时，我们看到了糟糕的 sendmail 进程。这些过程花费的时间太长，从几小时到一天，因为它应该只是对在 sendmail 中配置为智能主机的服务器进行中继。来自 FCGI 进程的大部分邮件需要不到 5 秒的时间。使用 $ ps -ef | 在我们的服务器上很容易找到缓慢的 sendmail 进程。grep 发送邮件

几乎所有的电子邮件都可以从这些 Web 节点正常工作。我猜成千上万的邮件可以毫无问题地通过。从命令行发送测试邮件很顺利。sendmail 命令很少被卡住，我们没有办法重现它。

似乎大多数卡住的电子邮件迟早会通过。这些似乎是在几小时后发送邮件，有时是一天后。

我们看到的所有卡住的 sendmail 都是由 Perl 进程运行的命令，Perl 进程是 FCGI 进程的子进程。

查看智能主机的日志，我们看到大部分邮件迟早都会通过，但我们发现有些邮件似乎从未发送过。

这是在 FCGI for Catalyst 中运行的，然后添加到执行一些处理的 IO::Async::Loop 中，并在 IO::Async::Loop 中使用 Email::Sender::Transport::Sendmail 执行open($fh, '|-', @args) 并通过管道传递邮件头+正文并执行 close($fh)。

我看过这个http://perldoc.perl.org/perlipc.html#Avoiding-Pipe-Deadlocks但不知道如何在这种情况下应用它。子 sendmail 只打开了 STDIN。

当我们有这些卡住的 sendmail 之一时，sendmail 正在 STDIN 上等待：

[<ffffffff8119ce8b>] pipe_wait+0x5b/0x80
[<ffffffff8119d8ad>] pipe_read+0x34d/0x4d0
[<ffffffff8119204a>] do_sync_read+0xfa/0x140
[<ffffffff81192945>] vfs_read+0xb5/0x1a0
[<ffffffff81192c91>] sys_read+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

并且异步 perl 进程正在等待孩子死亡：

#0  0x00007f8849e6065e in waitpid () from /lib64/libc.so.6
#1  0x000000000046dc2d in Perl_wait4pid ()
#2  0x000000000046de2d in Perl_my_pclose ()
#3  0x00000000004cec4e in Perl_io_close ()
#4  0x00000000004ceda8 in Perl_do_close ()
#5  0x00000000004c2629 in Perl_pp_close ()
#6  0x00000000004804de in Perl_runops_standard ()
#7  0x000000000042e7ad in perl_run ()
#8  0x000000000041bbc5 in main ()

一个没有通过的例子：Job #1653576（这只是我们的内部工作编号）有一个从 8 月 19 日 13:04 开始的 sendmail 流程。

webnode2上的进程：

fcgi-user 13621 13466  0 13:04 ?        00:00:00 /usr/sbin/sendmail -i -f admin@ourServer.org -- proffunnyhat@mit.edu
I don't see the record I expect to see on our smart host for this in /var/log/maillog that would indicate that it was relayed to nexus and then to MIT.

我确实在 8 月 21 日看到了 proffunnyhat@mit.edu 的成功电子邮件（来自 web2 /var/log/maillog）：

Aug 21 00:00:02 node-008 sendmail[13621]: u7JH4tbr013621: to=proffunnyhat@mit.edu, ctladdr=admin@ourServer.org (10520/10520), delay=1+10:55:07, xdelay=00:00:01, mailer=relay, pri=32292, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (u7L401Z1026237 Message accepted for delivery)
Aug 21 00:00:02 node-008 sendmail[26247]: u7L401Z1026237: to=<proffunnyhat@mit.edu>, delay=00:00:01, xdelay=00:00:00, mailer=relay, pri=122657, relay=mail.ourServer.org. [128.84.4.11], dsn=2.0.0, stat=Sent (u7L402jx001185 Message accepted for delivery)

然后在 mail.ourServer.org 上：

bdc34 @mail.ourServer.org: log$ sudo grep u7L402jx001185 maillog*
maillog-20160821:Aug 21 00:00:02 web2 sendmail[1185]: u7L402jx001185: from=<admin@ourServer.org>, size=2874, class=0, nrcpts=1, msgid=<201608191704.u7JH4tbr013621@mail.ourServer.org>, proto=ESMTP, daemon=MTA, relay=mail.ourServer.org [128.84.4.13]
maillog-20160821:Aug 21 00:00:03 mail.ourServer.org[1200]: u7L402jx001185: to=<proffunnyhat@mit.edu>, ctladdr=<e-admin@ourServer.org> (10519/10519), delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=122874, relay=dmz-mailsec-scanner-8.mit.edu. [18.7.68.37], dsn=2.0.0, stat=Sent (OK 5E/2D-20045-34729B75)

一个被卡住但似乎已发送的示例：mail.ourServer.org:/var/log/sendmail:

Aug 19 02:19:51 mail.ourServer.org sendmail[20792]: u7J6JlP6020790: to=<jxjx@connect.ust.hk>, ctladdr=<admin@ourServer.org> (10519/10519), delay=00:00:04, xdelay=00:00:04, mailer=esmtp, pri=122504, relay=connect-ust-hk\
.mai...ction.outlook.com. [213.199.154.87], dsn=2.0.0, stat=Sent (<201608190619.u7J6Jlda000738@web2.ourServer.org> [InternalId=15526306777069,...1MB1197.apcprd01.prod.exchangelabs.com] 9137 bytes in 0.189, 47.082 KB/sec\
 Queued mail for delivery)

我们尝试过的事情

我已经修改了 Email::Sender::Transport::Sendmail 以将 '\x00' 发送到管道，但这不起作用。

我已经用 IO::Async::Loop::Select 替换了 IO::Async::Loop::Poll。那并没有改变什么。

我已经尝试向 sendmail 及其父级发送信号。这杀死了他们，但邮件被中止了。

将我们的 fcgi 用户添加到 sendmail 的受信任用户文件中。没有改变任何东西。

我编写了一个从 STDIN 读取并写入 sendmail 的包装脚本。如果 STDIN 上 5 秒内没有任何输入，则退出。这对我来说真的很hacky，但它似乎确实有效。由于邮件是我们系统的关键部分，我宁愿有一个真正的解决方案。

score 0 · Accepted Answer

ikegami 的评论让我们找到了做双叉的答案。查看在 FCGI 下设置的信号处理程序和文件句柄可以清楚地表明，正在发生非常聪明的事情。所以我开始使用双叉来切断与父进程的所有联系，就像在启动守护进程时一样。那行得通。

  # FCGI does some clever signal handeling and file handers to do
  # its work.  This causes problems with forked processes that then
  # fork and wait for other processes. So use exec to get a process
  # that is not a copy of the parent. Double fork to avoid zombies.

  # Check -e of submit script because $! has a cryptic messsage if it doens't exist
  my $script = "$SAFEBINDIR/submit.pl";
  unless( -e -r -x $script  ){
    $submission->submit_log->error($submission->submission_id . ": $script doesn't exist");
    $c->flash( message => "There was a problem");
    $c->res->redirect( $c->uri_for('/user') );
    $c->detach;
  }

  # Do the double fork + exec to have an independent process
  my $pid;
  unless( $pid = fork() ) { #this is the child
    unless( fork() ){ #this is the grandchild
      exec( $script, $submission->submission_id ) #should never return
        or $submission->submit_log->error($submission->submission_id
                                          . ": Error when trying to exec() $script '$!'");
      exit(0);
    }
  }
  waitpid($pid,0); #wait for child, grandchild will get ppid 1
}

perl - Perl、IO:Async::Loop 和管道到 sendmail 的死锁

1 回答 1

Related

Reference