Handling Alert Storms?
Good morning,
Just wanted to put this out there to see how everyone else is handling these types of situations.
Scenario: Switch goes offline.
Alerts that generate:
BGP Alerts x5
OSPF Neighbor Alerts x5
Interface alerts x10
Uptime alert when it comes back online
Maybe a syslog event
Problem:
With that being said, how does your team know that all these alerts were related and should be bundled into the Switch going offline alert as the parent ticket?
Maybe it's a portal issue on our end but like OSPF Neighbors when clicking on them doing even topology map to the others... Same with BGP.
You can't dependent alert map instance alerts, only resources which is useless here.
I could maybe configure cluster alerts when more than 1 or 2 alerts of the same type generate but it can't be specifically grouped to pick and choose which instances.
Services don't really cut it as they don't suppress the individual alerts so at this point is my best bet to just document what alerts when the switch goes down and apply that to the offline alert of the switch or maybe look into better topology mapping so it's clearer?