monit - Monit - 如何识别程序崩溃而不是重新启动

Question

我正在使用 monit 来监视我的程序。被监控的程序可能会在 2 种情况下崩溃

程序可能会随机崩溃。它只需要重新启动
它进入不良状态并在随后每次启动时崩溃

为了解决后一种情况，我有一个脚本来停止程序，通过清理其数据文件将其重置为良好状态并重新启动它。我尝试了以下配置

check process program with pidfile program.pid
start program = "programStart" as uid username and gid groupname
stop program = "programStop" as uid username and gid groupname
if 3 restarts within 20 cycles then exec "cleanProgramAndRestart" as uid username and gid groupname
if 6 restarts within 20 cycles then timeout

说 monit 在 3 个周期内重新启动程序 3 次。第三次重新启动后，cleanProgramAndRestart脚本将运行。但是由于 cleanProgramAndRestart 脚本再次重新启动程序，在下一个循环中再次满足 3 次重新启动的条件，它变成了一个无限循环

任何人都可以提出任何解决此问题的方法吗？

如果可以执行以下任何操作，则可能有解决方法。

如果有“crash”关键字，而不是“restarts”，我将能够在程序崩溃3 次而不是重新启动3 次后运行干净的脚本
如果有办法在运行 exec 脚本后以某种方式重置“重新启动”计数器
如果只有在条件3 的输出重新启动时才可以执行某些操作

score 2 · Accepted Answer

Monit 会在每个周期轮询您的“测试”。循环长度通常定义在/etc/monitrc, 在set daemon cycle_length

因此，如果您cleanProgramAndRestart执行的时间少于一个周期，则不应该发生。正如它正在发生的那样，我猜你cleanProgramAndRestart需要的不仅仅是一个周期来执行。

你可以：

增加监控配置中的循环长度
每 x 个周期检查一次程序（确保 cycle_length*x > cleanProgramAndRestart_length）

如果您无法修改这些变量，则可能有一些解决方法，使用临时文件：

check process program 
  with pidfile program.pid
  start program = "programStart" 
    as uid username and gid groupname
  stop program = "programStop" 
    as uid username and gid groupname
  if 3 restarts within 20 cycles 
  then exec "touch /tmp/program__is_crashed" 
  if 6 restarts within 20 cycles then timeout

check file program_crash with path /tmp/program_crash every x cycles #(make sure that cycle_length*x > cleanProgramAndRestart_length)
  if changed timestamp then exec "cleanProgramAndRestart"
    as uid username and gid groupname

monit - Monit - 如何识别程序崩溃而不是重新启动

1 回答 1

Related

Reference