我有一个每 2 小时运行一次的备份脚本。我想使用 CloudWatch 来跟踪此脚本的成功执行,并使用 CloudWatch 的警报来在脚本遇到问题时得到通知。
每次成功备份后,该脚本都会在 CloudWatch 指标上放置一个数据点:
mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1
每当指标上的统计“总和”在 6 小时内小于 2 时,我就有一个警报进入 ALARM 状态。
为了测试这个设置,一天后,我停止将数据放入指标中(即,我注释掉了 mon-put-data 命令)。很好,最终警报进入 ALARM 状态,我收到了一封电子邮件通知,正如预期的那样。
问题是,一段时间后,警报又回到 OK 状态,但是没有新数据被添加到指标中!
已经记录了两个转换(OK => ALARM,然后 ALARM => OK),我在这个问题中重现了日志。请注意,虽然两者都显示“周期:21600”(即 6 小时),但第二个显示 startDate 和 queryDate 之间的 12 小时时间跨度;我看到这可能解释了这种转变,但我不明白为什么 CloudWatch 考虑使用 12 小时的时间跨度来计算 6 小时的统计数据!
我在这里想念什么?如何配置警报以实现我想要的(即,如果没有进行备份,则会收到通知)?
{
"Timestamp": "2013-03-06T15:12:01.069Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-05T21:12:44.081+0000",
"startDate": "2013-03-05T15:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 3
}
},
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from OK to ALARM"
}
第二个,我根本无法理解:
{
"Timestamp": "2013-03-06T17:46:01.063Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
},
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T17:46:01.041+0000",
"startDate": "2013-03-06T05:46:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from ALARM to OK"
}