Forum Discussion

Anonymous's avatar
Anonymous
3 months ago

User defined "host dead" status

There are two ideas that I need help maturing before talking to LM about them. Both have to do with how LM uses server side logic to declare a device dead. 

We need the ability to designate what metric declares a device as dead/undead and when

What

We have several customers who have devices, usually UPS, at remote sites all connected to a Meraki switch. The collector is not at the remote site, but connects over a VPN tunnel, which may be torn down due to inactivity or could be flaky for any other reason. When the VPN tunnel goes down, the devices alert that they have gone down. We have added monitoring to the tunnel and also get alerted when it goes down. However, we'd like to prevent the host down alerts when the only problem is that the VPN tunnel is down. RCA (or recently renamed DAM?) would likely solve this, but defining that mapping manually or through a topologysource is not scalable (plus visibility into the RCA logic is never been good).

Luckily, Meraki has an API where we can query the status of devices connected to the switch. During a tunnel outage, this API data shows that the device is still connected to the switch and online. Since it's a UPS, that's sufficient. We've built the datasource required to monitor the devices via the Meraki API. However, since it's a scripted datasource, it doesn't reset the idleInterval. (Insert link here to a really good training or support doc explaining how idleInterval works.) Since none of the qualifying datasources are working on the UPS during the VPN outage, the idleInterval eventually climbs high enough to trigger a host down alert. When the host is declared down, other alerts, like the alerts from this new Meraki Client Device Status DS, are suppressed. 

How can this be remedied?

So, we need the supported and documented ability to use the successful execution of a collection script to reset the idleInterval. I know this is possible today as I've seen it in several of LM's modules. However, I've never seen official documentation on how to do it. LM's probably worried someone will add it to all their scripts, which wouldn't be the right thing to do. 

When

I know I'm not the only one. I need control over the server side logic that determines when the idleInterval declares a device dead. In the example above, we get a slew of host down alerts when the VPN tunnel goes down. However, usually within a few minutes, the VPN gets reestablished and the collector reestablishes connectivity to the device and the idleInterval resets, thus clearing the alerts. With a normal datapoint, I'd just lengthen the alert trigger interval for the idleInterval datapoint. This would mean that the device would have to be down for 15 minutes, 20 minutes, however long I want before generating the alert. What's great is that now we can do that on the group level, so I can target these devices specifically and not alert on them unless they've been down for a truly unacceptable amount of time (i.e. not just a VPN going down and coming right back up). 

However, the idleInterval datapoint is an odd one. Two things happen. One happens when you surpass the threshold defined on the datapoint. I can't remember what the default is, but in my portal, that's > 300 or 5 minutes. At 6 minutes server side logic, which has been inspecting the idleInterval, decides that the device is down which has implications on suppressing other alerts on the device.

As far as I can tell, lengthening the alert trigger interval on the idleInterval datapoint has no effect if the window would exceed the 6 minutes that the server side logic uses to declare the device down.

What do we need?

We need the ability to set the amount of time that the server side logic uses to declare the device down. We need to be able to set that for some devices and not others. So we need to be able to set it globally, on the group level, and on the device level. Preferably this could be set in the alert trigger interval on the idleInterval datapoint since this mechanism already exists globally and on the group and device levels. Knowing that this could be a confusing way of defining it (since it's measured in poll cycles not minutes/seconds) so, it could alternatively be done as a special property on the group(s)/device(s). 

I'm interested in hearing your thoughts, even if you are an LMer.

  • I agree this would be a boon. I regularly process alerts which have time-to-ticket SLA's of 20 minutes. Because the dead timer kicks in after 6 minutes, our HostStatus/idleinterval alerts generate at 5 minutes and are then sent to an escalation chain with 15 minutes of empty stages. 
    This results in tickets generating on time, but also means our alert history is full of erroneous, 5-10 minute "outages".
    Besides the clutter and general degredation of the alert views, this means we also have to perform all auditing of outage counts and frequency from our PSA. Not necessarily the end of the world, but certainly a degredation of LM's reporting functionality for us.


    Separately, we have issues with devices monitored only via API, in particular the latest VMware SDWAN datasources. 
    To facilitate effective NetScan discovery, these resource have unique identifiers for hostnames which aren't pingable. So we don't have Pings or DNS available to update the idleinterval, and we rely entirely on the API feedback to determine whether an edge is up or down. HostStatus is non-functional and irrelevant for these resources.

    Due to our scale and rate-limiting issues, the most frequently updated datasource only runs every 5 mins. Since this is the fastest collection, processing delays often result in resources being labeled as Dead when they aren't. This makes for a messy interface and causes concerns from clients who keep their eyes on the portal. 
    I'm actively working on a solution, but suffice to say that we can't generate additional API queries without risking rate limiting, and changing the resources hostnames to an IP either manually or automatically will create endless conflicts with the NetScan discovery script.

    Being able to adjust the dead timer would resolve this outright. 

    Alternatively, as you mentioned, having an understanding of the scripting methods used by LM to update the idleinterval would be extremely helpful in cases like this. Then we could potentially implement datasources to keep certain resource types, like those edges, alive.

    • Mike_Moniz's avatar
      Mike_Moniz
      Icon for Professor rankProfessor

      As a workaround, and it's not documented (agree it should be) but here is code that will tell LM the device is live. Can be added to an existing groovy DataSource.

      import com.santaba.agent.live.LiveHostSet
      
      // Update the liveHostSet to tell the collector we are happy (updates Host Status)
      hostId = hostProps.get("system.deviceId").toInteger()
      liveHostSet =  LiveHostSet.getInstance()
      liveHostSet.flag(hostId)
      

      For others: just be careful with this. Since this is designed to bypass LM's built-in dead-device detection you want to determine that the device is really still up before calling this code. But may help you code around limits with the existing dead-device methods.

      This is likely also unsupported by LM, although I got the code from some of their own DataSources.

      • Patches's avatar
        Patches
        Icon for Neophyte rankNeophyte

        Much appreciated, Mike. This cleaned up my "dead" API-only devices. It'll be a solid stopgap until we get some form of official resolution.

  • Just replying to express my support. You've covered my thoughts already. 

  • This is something I've been thinking about for a while, but no related work has been prioritized.

    Interestingly (or maybe just pedantically) we can't actually determine when an entire physical host is down; the best we can (and do) determine is that it is not reachable. We assume it's down after a server-side hard coded period of the idleInterval not resetting (as you've seen).

    By the way, the idleInterval generally only resets for methods where we can assume that the host returning the data, is the current LM Resource of the current execution context. If you have a host "host.example.com" and you apply a script module to it that gets data from "otherdeviceapi.com", that doesn't confirm to LM that "host.example.com" is still returning data. This might be an overly rigid assumption. At least offering the ability to toggle that on a given module, would probably go a long way.

    Back to host status, changing the alert does nothing but delay your alert (or send it before our servers declare it down, depending on which direction you go).

    And yes, RCA is now DAM, Dependent Alert Mapping.

    I would love to be able to do something like, take a SQL server, and configure "down" to be any of the following:
    - SQL server is not responding
    - SQL queries taking longer than N seconds to return
    - Standard Host Down

    We can also be much more certain about SQL being down vs inaccessible, than we can about the host itself being down vs inaccessible, because we can see that SQL stopped but Ping is still working (hypothetically).

    I've kicked around a few ways of doing this in my head, but my favorite is just being able to "declare" an alert as "this indicates a host/service down condition".

  • need the ability to set the amount of time that the server side logic uses to declare the device down.

    So what is the thought on this part of the request others are making?

    • Mike_Rodrigues's avatar
      Mike_Rodrigues
      Icon for Product Manager rankProduct Manager

      Seems like a reasonable first step into opening up the configurability of Host Down.

      I'd love to remove the server-side logic altogether and just let users decide what constitutes "down" for a given host. We'd of course have some reasonable defaults.