Forum Discussion

grantae's avatar
4 years ago

Rate Limit Throttle Message Disable

"If the number of alerts delivered to the chain’s initial stage exceeds the rate limit, then a throttle message is sent to the individuals assigned to that stage. The message states that the number of alerts has exceeded the throttling level. From this point forward, alerts will be escalated to subsequent stages in accordance with your chain’s configuration. Throttle messages, however, will not be escalated and will continue to be sent to the first stage."

https://www.logicmonitor.com/academy/alert-rate-limiting

We have it set up that an email is sent to our ticketing system to open a ticket for Critical alerts. Turns out that the throttle message opens a ticket too. Is there a way to disable the throttle message?

  • Anonymous's avatar
    Anonymous

    Ah, that's it. Thanks @Michael Rodrigues. So, the host gets marked as dead when the heartbeat contains an epic that is more than 6 minutes old. When it's marked dead, that's when other alerts from the device are suppressed.

     

    From the idleInterval datapoint description:

    Quote

    The interval in seconds we do not get data from the host. NOTE: there is server side logic that declares a host DOWN after 6 minutes, suppressing other alerts. We do not recommend you change this alert.

     

    Does that get you a path moving forward?

  • I was actually just reading that warning in LM. It sounds like I should leave the Host Status at 6min, which means I should focus on editing the other alerts to wait at least 6min before triggering a ticket.

    20 minutes ago, Michael Rodrigues said:

    You can't change the Host Dead time, it's hardcoded backend logic. You can change the Host Status alert but I don't recommend it, as it doesn't change the back end logic for host death and suppression.

    The only magic is the heartbeat interval on the HostStatus DS. It doesn't get reset by every collection type, for example SCRIPT modules will not reset it. Ping will. The backend looks at this value to determine host death and suppression.

     

     

    It is good to note that changing the idleinterval in Host Status doesn't sound like it will actually mark the device dead sooner since other backend logic is at work. I will be sure to leave it alone. I will look into editing the other alerts with this in mind.

    Is Host Status the best metric to base down tickets off of; and ping with alert intervals that wait about 4 intervals (ping is set to 2min by default) so about 8min past before declaring a degrade and making a degrade ticket? Is there a better metric to use for down and degraded site tickets?

  • Anonymous's avatar
    Anonymous
    16 minutes ago, grantae said:

    Is Host Status the best metric to base down tickets off of; and ping with alert intervals that wait about 4 intervals (ping is set to 2min by default) so about 8min past before declaring a degrade and making a degrade ticket? Is there a better metric to use for down and degraded site tickets?

    That's really a subjective question. It really depends on how you define "down". I personally like to measure "site down" using IPSLA, but there are many ways to skin that cat.

  • Wait... how did I get 61 of 109 being Throttle alerts if my settings are Throttle for 10min and 5 alerts? There shouldn't be a way to get more Throttle alerts than "real" tickets.

  • Anonymous's avatar
    Anonymous

    When throttling is happening, the alerts aren't just queued up; they are not sent. So, if you got 48 normal tickets and 61 throttle tickets, the 48 tickets happened when you were not throttled and many many more happened while you were throttled.

  • Oh ok, that makes more sense. It "cancels" the alerts instead of delaying them.

  • Hmmm... I guess that is true. For our branches we use Palo Altos as the first reachable device at a branch site, however are major circuits in the core I believe those are between Cisco devices. It might be good to make branch down tickets for the sites with Host Status (and use use alert intervals for things like ping and over utilization for these sites) but maybe a different metric for core/distro devices. Hmmm....

    I'll play with ideas on how to organize/configure the alerts better so I don't have a crazy flood of alerts that end up needing throttling. (Got like 109 tickets from yesterday, I think from a security scan. 61 of the tickets were the Throttle alerts!)