Forum Discussion

Vitor_Santos's avatar
5 years ago

Cisco EIGRP Peer alarm(s) not being supressed?

Hello,

We've noticed the Cisco EIGRP PeerDown alarm(s) aren't being suppressed if the actual device goes down on LM.
When lost SNMP connectivity to one of our routers, it started returning PeerDown alarms (since SNMP wasn't responding, causing the 'NoData' condition at the 'upTime' datapoint).
This becomes an issue because the actual datapoint that checks the Peer status, bases itself on the data retrieved by the 'upTime' datapoint (which at this point, is as 'NoData).

Basically, if the 'upTime' doesn't return data (which happens if the actual device goes down) it'll trigger an alarm for the PeerDown instances (since it'll always return False).
LogicMonitor only sees the actual device as 'down' after 5 minutes (when not retrieving data). This DS will alarm first (since the PeerDown will return an alarm on 2 consecutive tools - which means 3 minutes).

As per the documentation, all the alarm(s) emanating from the host will be suppressed. My question here (just to make sure) is, this will only be the case for alarms that hit 'AFTER' the host down condition correct?
If that's true, how can we surpass this without having to increase the time that 'PeerDown' alarms took to appear in the console?

Is there any type of expression that we can use in that ComplexDatapoint (instead of the current one).
Because, currently the fact of this device being down, caused 100 alarm(s) on the console (since it's a central point for our EIGRP routing).

Thank you!

Regards,

  • 2 hours ago, Mike Moniz said:

    It will rely on ICMP and other things too...

    "...the idleinterval datapoint within the HostStatus DataSource measures the amount of time in seconds since the LogicMonitor Collector was able to collect data from your host via built-in collection methods (SNMP, Ping, WMI, ESX, etc.)... Note that data collected by script DataSources does not affect the value of the idleinterval datapoint."

    https://www.logicmonitor.com/support/logicmodules/datasources/creating-managing-datasources/host-status-host-behavior/

     

    Got it. I'll differ this internally, because this could be an issue for us.
    We've clients that don't give us ICMP access on purpose (but then we've SNMP access).

    Thank you for the info!

  • It will rely on ICMP and other things too...

    "...the idleinterval datapoint within the HostStatus DataSource measures the amount of time in seconds since the LogicMonitor Collector was able to collect data from your host via built-in collection methods (SNMP, Ping, WMI, ESX, etc.)... Note that data collected by script DataSources does not affect the value of the idleinterval datapoint."

    https://www.logicmonitor.com/support/logicmodules/datasources/creating-managing-datasources/host-status-host-behavior/

  • Ok, so I ended up doing it like this:

    - if(eq(snmpDown,1),2,if(un(upTime),0,1))

    It does the trick, thank you! 

    I've disabled the SNMP on the device (to force the condition), however, LM doesn't see that device as dead.
    What's exactly needed for LM to consider the 'Device Dead'? It relies on ICMP as well?

  • You can nest if's together in the same kinda way you do in Excel. This is just off the top of my head and untested, but you would do something like:

    if(snmpDown,1,if(un(upTime),0,1))

     

  • Basically I want to do what the PeerDown expression currently does:

     

    Only if the snmpDown == 0, else, return 2 (or something != than 0)

  • Ok so I've added that try, except on the actual script.

    So it pretty much returns 0 if the SNMP portion goes well & returns 1 if it catches the timeout exception.
    Just added the actual SNMP walk code into the try{} & added the one below as catch()

    So now we're able to know if SNMP isn't working. I'm kinda lost on what to do at the 'PeerDown' datapoint (in terms of expressions). Can you help?
    Never used the complex datapoint features before.

  • That's the basic idea. You can't make complex datapoint via groovy so snmpDown would be a normal datapoint which you can then refer to it in PeerDown. Also I think you can just wrap the snmp.get/walk line or section in a try/catch and that will let you know the snmp request failed.

  • After checking the OIDs I don't believe the upTime can tell that difference.
    I'll try to leverage that 'general' change & see if it works for us. That's a great idea!

    Basically we could just add a new complex datapoint (via groovy) & try to poll a basic OID. If it doesn't return data, then assume snmp isn't replying (snmpDown == 1).
    From there just tweak the actual PeerDown to actually have that value in mind before returning 0. 

    Am I in the right path? Or you had something more simple in mind?

    Thank you anyway for the input on this !

  • That is my understanding too, LM has server-side logic to declare a device dead after 6 minutes (but Host Status will alert after 5min), so any alerts that occur before those 6 minutes will cause notifications.

    PeerDown is using the un() function so it's specifically looking if it's NaN or not. I don't know how this particular DataSource or Cisco EIGRP works so I'm not clear if upTime can tell the difference between peer down or switch down, there might be a trick to do so. But in a more generic solution and since this is a script based DataSource, I likely would add a new DataPoint and code for something like snmpDown that reports 1 if snmp isn't working (aka device will be dead soon) and then modify the PeerDown to also check if snmp is working before alerting.