Forum Discussion

MaddyM's avatar
2 years ago

Adjusting Host Status Idle Interval Alert

Has anyone adjusted the host status idle interval alert? I am constantly getting critical alerts for devices being down only for the alert to clear after a minute or two. In the datasource it mentions that LM does not recommend changing the alert but it’s creating a lot of false positives. If you haven’t changed the alert does anyone have any suggestions for decreasing these false postivies?

  • mray's avatar
    mray
    Icon for LM Conqueror rankLM Conqueror

    A lot of really good advice and feedback in this thread and I agree with all of it. It’s best to first try and address, or at least understand, the underlining cause of these alerts. As others have mentioned, this could be indicative of poor Collector health -- overload, tasks not executing properly, etc. I’m adding some links below for additional context.

    Info on monitoring your Collectors: https://www.logicmonitor.com/support/collectors/collector-management/monitoring-your-collector/

    More info on the Host Status DataSource here: https://www.logicmonitor.com/support/logicmodules/datasources/creating-managing-datasources/host-status-host-behavior

    Please feel encouraged to submit a support ticket if you’d like some assistance with troubleshooting.
    https://www.logicmonitor.com/support/about-logicmonitor/customer-support/get-support-resources 

  • We have seen this behavior when the collector is falling behind. If you haven’t I suggest adding in the collector as a resource to be monitored. And check the Collector graphs to see if it is falling behind. By default the collector will set off Warnings for TasksCountInQueue and a few others.

  • Anonymous's avatar
    Anonymous

    You can’t change is the poll interval, as for “Internal” type DSs, the selected poll rate has no impact. 

    Also, you shouldn’t change the threshold to anything other than 300, since that’s the internal value at which LM marks the host down, which suppresses all alerts opened after that point. So, if you make it more than 300, the device will be marked as down and all other alert notifications (including the one that would tell you that the device is down) will be suppressed.

    I’m not sure the impact of changing the trigger and clear intervals though. 

    Are you sure they’re false positives? Is it possible there’s something actually going on, but it resolves/heals quickly enough that you’re not seeing it? One thing to consider is that all of this is based on the collector’s ability to report data back to the LM platform on a continuous basis. Host status really does mean that the platform hasn’t gotten any data for that device in 5 minutes. What could cause all the datasources on a device to return no data all at the same time for 5 whole minutes? 

    Do your ping alerts open before the host status alert? Check the ping DS. Does it have data the whole time? If it does, there’s something broken with LM. If Ping has data, the host status alert shouldn’t be opening. 

    FWIW: we have not modified anything on the host status DS.

  • If you just don’t want to receive alerts (versus having them logged in the UI), one trick we’ve used is a delayed escalation chain with an empty stage 1.  If you want the alerts not to exist at all, you’ll have to change the thresholds.