Forum Discussion

Rick_Bath's avatar
7 years ago

How to validate your Alert Rule & Escalation Chain

You must have set up your Alert Rules & Escalation Chains hoping that it is setup correctly. What if it was not set up accurately and it does not Alert the right group or even worse it does not alert at all? The worst thing is for you not to receive an alert when a device is down or let's say you have a disk which is filling up due to logs which have been set to a verbose mode which one of your teammates did not change the level back after troubleshooting.

In this article, you will be guided how to setup an effective Alert Rule & Escalation chain. In addition, we will show you how to deliver a live alert without creating any impact to the system in question. Before diving into the troubleshooting steps, below are the difference between Alert Rules and Escalation Chains.

Alert Rules are used to tag the respective Escalation Chains when a certain device reaches the defined severity level. You could define this Alert Rule to use an Escalation chain only when a certain data point is reached.

Escalation Chains are used to set the delivery method for Alerts. This could be set to deliver your alerts via email, sms, ticketing systems, custom HTTP integrations, etc. You may also set your Escalation Chain to be routed to different groups of people during different times/days. This is useful for different sets of standby engineers for a 24x7 operation.

 

Alert Rules & Escalation Chains are very powerful if used correctly.

To begin, we will first create an Escalation Chain. For this example, i will create it for Windows devices. We recommend enabling rate limit as you will not want to receive a flood of alerts. By doing so, it limits the maximum number of Alerts delivered in the defined time.

If you are wondering, i created 3 stages for different delivery methods (email, Hipchat & voice).

The duration that it takes to move from one chain to the other is defined within the Escalation Interval of the Alert Rule.

 

Screen Shot 2017-07-12 at 11.59.19 AM.png

This is an optional section where we have the ability to route alerts to different people depending on the time and day. It is quite simple, just select the days & timing for the respective stages.


 

Screen Shot 2017-07-12 at 11.59.36 AM.png

 

This section below for the creation of Alert Rules requires good planning. Alerts are triggered based on on the priority level. It will start from the lowest to the highest number. It should start with the most granular to the most number of wildcards.

A common use case is:

Create an Alert rule to send Interface related Alerts to the network team

  1. Create an Alert rule to send hardware or performance Alerts to sysadmin team

  2. Create an Alert rule to send Exchange Alerts to the messaging team

  3. Create an Alert rule to send all other alerts to the sysadmin team


 

Screen Shot 2017-07-12 at 12.07.37 PM.png

 

Another essential portion which we need to focus on is the Group which it is applied to. We get this question asked countless times. It’s an easy fix but it is knowing what to fix.  If you set it to * it will apply to all groups - which is great. However, we know that we can’t apply the Alert rule to all devices. We might need to apply different alert rules to a different type of devices (e.g: Server, Switches, Routers, WAN Links, etc).

Let's say you have a router “wan01” which resides in the group “Infrastructure -> Critical -> Networking -> Routers -> WAN”. If you apply the Alert Rule to “Infrastructure/Critical/”, your device will not pick up this Alert Rule as it resides in subtree. The fix is simple, just apply the Alert Rule to “Infrastructure/Critical/*”. This will Apply to all subgroups under Critical.

Now, once you have set that up, I'm sure you would like to verify if that if the Alert Rule is picked up by the datasource or instance in question.

 

To do so, navigate to the datasource or instance in question. Click on the COG button and it will show you the Alert Rule, Escalation Chain and delivery method for each stage. This is how you can determine if your Alert Rule or Escalation chain is picked up.

 

Screen Shot 2017-07-13 at 3.14.20 PM.png

The next thing is to validate the delivery of an Alert. Yes, we could click on the “Send Test Alert”. I’m sure we prefer to have an actual alert to see how it works. My favourite datasource to use is the Ping datasource with the PingLossPercent datapoint. To trigger an alert, we could change this value to “>=0”. What this will do is to send an Alert when the Ping Loss is more than or equal to 0.

To do so, it’s quite easy too.Click on the pencil icon within the line of PingLossPercent. Click on the + sign as this will create an instance level threshold. What you want to do is to set the value to 0 for critical. You should receive the Alerts quite soon after.

 

Screen Shot 2017-07-13 at 4.22.46 PM.png

Once you have received the alerts and verified its all working, remember to remove it as you dont want to get flooded with alerts. I hope this article has provided you with sufficient information on how to setup an alert, test and trigger the Alerts.