Forum Discussion

grantae's avatar
4 years ago

Alerts for down and degraded sites

Currently we have LM setup to email Cherwell (ticketing system) to open a ticket based on ping. Our ping rule is 20 / 70 / 90. 

Downside is if the site is going up and down and the pings just drop for a short amount of time, the alert clears and comes back resulting in multiple tickets for the same issue or a ticket for an "outage" that cleared right away. Our Cherwell setup is not smart enough to not make multiple tickets with the same subject, so the issue cannot be fixed on the Cherwell side. 

I was thinking to set the down branch tickets to open based on Host Status instead. I see this alert when branches actually go down and seemed like a good metric to use.

I would also like a degraded ticket if the ping is bad over a longer period of time. Currently I'm not really sure what the ping period is for calculating the ping percent. Is it out of every 5 pings or 100 pings or 5mins worth? It would be good to have a ticket for if the ping is over 100ms for 5min and/or if the pings drop 10% of a 5min period. (I'm guessing on the specifics for these, so open to suggestion on better degrade metrics.)

I didn't originally setup the Alert Rules and Escalation Chains in LM so there seem to be a lot of redundant and oddly setup rules and chains since we are still pretty new to LM and still learning. Trying to clean it up and improve the alerts without breaking the current alerting.

I started with trying to make a Alert Rule for down branches with Host Status. Originally the down branch rule based on ping was the top rule (5) so I copied the setup of that rule and made this rule at 4. The only difference is Instance = HostStatus* (was Ping*) and Datapoint = idleInterval* (was PingLossPercent*). However, the emails in the ticket still have the ping info in them, so I'm not sure if they are actually hitting and using the rule as intended. 

  • Anonymous's avatar
    Anonymous

    What you're looking for is alert trigger interval (and alert clear interval). Those aren't settings on the alert rule, but actually on the alert itself. Today, this is also a global setting, so use with care. These intervals specify how many additional poll cycles are required to be in violation (or in a cleared state) before the alert is opened (or cleared). 

    Since they are global settings, you have a few options on how to implement (it can be done non-globally, but you should approach it with your eyes open). Take a look and reply back if that's what you're looking for.

  • The alert trigger intervals does help. I'm looking into better ways to organize what alerts to actually use for tickets and how to set them up. Might take a bit to get back to this thread but some questions about this are coming up in the thread Rate Limit Throttle Message Disable. (Just an FYI for anyone searching for this type of info, since IDK if it would come up in a search based on the title.)

    /topic/7056-rate-limit-throttle-message-disable/