Forum Discussion

mnagel's avatar
mnagel
Icon for Professor rankProfessor
6 years ago

dependencies, again

We continue to do battle with LM when alerts trigger due to dependent resource outages.  I know the topology mapping team is working on alert suppression, but I am not convinced that will solve all problems regardless of how well they succeed.  We really need a way to setup dependencies within logic modules and it should not need dozens of lines of API code each time (most of which should be made available as a library function IMO). 

One fresh specific example -- site with multiple firewalls in a VPN mesh running BGP.  One firewall goes down, then all other firewalls report BGP is down.  We care about BGP down, so we have alerts trigger escalation chains.  It should be possible to define a dependency in the datapoint that suppresses the alert if the remote peer IP is in a down state.  There is no way to express this in LM right now and that leads to many alerts in a batches, and that leads to numb customers who ignore ALL alerts.

  • We are currently in beta for an alert notification suppression feature using topology relationships to map out dependent relationship. We are actively working to extend topology coverage for additional devices and technologies and to also expand the triggering of dependency evaluation beyond Ping-PingLossPercent and HostStatus-idleInteral. 

    For the example you provide if BGP was supported by topology the dependency setup would involve selecting a root or entry-point device, such as the collectors server or the remote peer, on which other resources are dependent. Ping and HostStatus not being sufficient we would add definition for which datapoints are used to determine a down state for given devices so that a singular definition for dependency evaluation could be applied that could vary based on device type. This would allow for a down resource to trigger dependency evaluation for its parent and child resources in order to determine root cause and add contextual metadata to alerts so that notifications are suppressed based on the role the alerting resource plays in the dependency incident. (e.g. dependent resource alerts would be suppressed while originating/root cause resource alerts would be routed for notification)

  • The key here is "if BGP was supported...".  What if it is not?  Do you think it would be given this specific case?  I think it could be (i.e., peering topology identified), but to the extent it is not (or anything else is not), I think we need a way to reflect the dependency without serious programming effort to avoid alarm storms.  I guess we have something to chat about next time we meet :)/emoticons/smile@2x.png 2x" title=":)" width="20">

  • While topology coverage expansion is a high priority, in the next year we also have planned initiatives intended to reduce alert noise based on other sources aside from topology. For example, alert grouping, where we can group related alerts to avoid alerts storms and this grouping is not reliant on topology.

    For topology coverage expansion the LM Exchange may help as well to develop new modules that can be shared with the community. 

    But yes, definitely have more to chat about on our next call!