User defined "host dead" status
There are two ideas that I need help maturing before talking to LM about them. Both have to do with how LM uses server side logic to declare a device dead.
We need the ability to designate what metric declares a device as dead/undead and when.
What
We have several customers who have devices, usually UPS, at remote sites all connected to a Meraki switch. The collector is not at the remote site, but connects over a VPN tunnel, which may be torn down due to inactivity or could be flaky for any other reason. When the VPN tunnel goes down, the devices alert that they have gone down. We have added monitoring to the tunnel and also get alerted when it goes down. However, we'd like to prevent the host down alerts when the only problem is that the VPN tunnel is down. RCA (or recently renamed DAM?) would likely solve this, but defining that mapping manually or through a topologysource is not scalable (plus visibility into the RCA logic is never been good).
Luckily, Meraki has an API where we can query the status of devices connected to the switch. During a tunnel outage, this API data shows that the device is still connected to the switch and online. Since it's a UPS, that's sufficient. We've built the datasource required to monitor the devices via the Meraki API. However, since it's a scripted datasource, it doesn't reset the idleInterval. (Insert link here to a really good training or support doc explaining how idleInterval works.) Since none of the qualifying datasources are working on the UPS during the VPN outage, the idleInterval eventually climbs high enough to trigger a host down alert. When the host is declared down, other alerts, like the alerts from this new Meraki Client Device Status DS, are suppressed.
How can this be remedied?
So, we need the supported and documented ability to use the successful execution of a collection script to reset the idleInterval. I know this is possible today as I've seen it in several of LM's modules. However, I've never seen official documentation on how to do it. LM's probably worried someone will add it to all their scripts, which wouldn't be the right thing to do.
When
I know I'm not the only one. I need control over the server side logic that determines when the idleInterval declares a device dead. In the example above, we get a slew of host down alerts when the VPN tunnel goes down. However, usually within a few minutes, the VPN gets reestablished and the collector reestablishes connectivity to the device and the idleInterval resets, thus clearing the alerts. With a normal datapoint, I'd just lengthen the alert trigger interval for the idleInterval datapoint. This would mean that the device would have to be down for 15 minutes, 20 minutes, however long I want before generating the alert. What's great is that now we can do that on the group level, so I can target these devices specifically and not alert on them unless they've been down for a truly unacceptable amount of time (i.e. not just a VPN going down and coming right back up).
However, the idleInterval datapoint is an odd one. Two things happen. One happens when you surpass the threshold defined on the datapoint. I can't remember what the default is, but in my portal, that's > 300 or 5 minutes. At 6 minutes server side logic, which has been inspecting the idleInterval, decides that the device is down which has implications on suppressing other alerts on the device.
As far as I can tell, lengthening the alert trigger interval on the idleInterval datapoint has no effect if the window would exceed the 6 minutes that the server side logic uses to declare the device down.
What do we need?
We need the ability to set the amount of time that the server side logic uses to declare the device down. We need to be able to set that for some devices and not others. So we need to be able to set it globally, on the group level, and on the device level. Preferably this could be set in the alert trigger interval on the idleInterval datapoint since this mechanism already exists globally and on the group and device levels. Knowing that this could be a confusing way of defining it (since it's measured in poll cycles not minutes/seconds) so, it could alternatively be done as a special property on the group(s)/device(s).
I'm interested in hearing your thoughts, even if you are an LMer.