Forum Discussion
There's a fine line between sending too many alerts and not enough.
When an alert increases in severity over the state it was acknowledged in, we treat the new level as un-acknowledged.
But in this case, the alert has been acknowledged at the critical level. If it drops to error, but doesn't clear entirely, then goes back to critical, we regard it as being in the same alert session -so the same critical acknowledgement applies. If the alert entirely clears, the future increases to critical are not treated as acknowledged.
We do this for a few use cases:
- a metric that is oscillating over a threshold. (e.g. a disk volume that is 97%, then 98%, then 97%, then 98%, then 97%, etc) You probably do not want to have fresh escalations each time it bursts over 98%.
- philosophically, the system is treating the acknowledgement of the alert at the critical level as someone saying "I will assume ownership of this issue, up to this severity." In this case, its the maximum severity (critical), so it is ownership of the issue until it clears.
If you want to prevent alert escalation of the criticals, but are unable to clear the issue, I'd suggest not acknowledging the alert immediately - instead put the instance in scheduled downtime for 1 hour (which will stop escalation). Work on the issue, clear it to error (or warn). If you are unable to improve beyond that, acknowledge that state. (Or adjust thresholds.) Then a future increase to critical will be escalated.
Related Content
- 10 months ago
- 8 years ago
- 9 years ago