This is something I've been thinking about for a while, but no related work has been prioritized.
Interestingly (or maybe just pedantically) we can't actually determine when an entire physical host is down; the best we can (and do) determine is that it is not reachable. We assume it's down after a server-side hard coded period of the idleInterval not resetting (as you've seen).
By the way, the idleInterval generally only resets for methods where we can assume that the host returning the data, is the current LM Resource of the current execution context. If you have a host "host.example.com" and you apply a script module to it that gets data from "otherdeviceapi.com", that doesn't confirm to LM that "host.example.com" is still returning data. This might be an overly rigid assumption. At least offering the ability to toggle that on a given module, would probably go a long way.
Back to host status, changing the alert does nothing but delay your alert (or send it before our servers declare it down, depending on which direction you go).
And yes, RCA is now DAM, Dependent Alert Mapping.
I would love to be able to do something like, take a SQL server, and configure "down" to be any of the following:
- SQL server is not responding
- SQL queries taking longer than N seconds to return
- Standard Host Down
We can also be much more certain about SQL being down vs inaccessible, than we can about the host itself being down vs inaccessible, because we can see that SQL stopped but Ping is still working (hypothetically).
I've kicked around a few ways of doing this in my head, but my favorite is just being able to "declare" an alert as "this indicates a host/service down condition".