Forum Discussion

Kelemvor's avatar
Kelemvor
Icon for Expert rankExpert
2 months ago

Does anyone have a Module just to check if SNMP is working at all?

Hi,

We have a lot of times where a server will stop responding to SNMP checks for whatever reason.  That will generally turn CPU, File System, Uptime, etc all to No Data responses.  We generally have No Datas set to Not generate alerts so we don't get spammed with things.

However, if snmp really does stop working, we'd want to know that so we can go fix whatever the problem is.  We currently have Host Uptime set to generate an alert on No Data, but it usually gets ignored because people think it's because something got rebooted recently or something.

So, I want to setup a check that just does the most basic snmp check it can, something that every device should respond to, and can then alert if it doesn't get a response.

I looked in the Repository and didn't see anything official from LM for this, but did see a community created Module that might work.  It's called SNMP_Troubleshooter and there are two of them for some reason.  The older one has a newer version # which is a bit odd.  Anyway, all it does i:

import com.santaba.agent.groovyapi.snmp.Snmp;

def host = hostProps.get('system.hostname');

try{
    oid_value = Snmp.get(host, ".1.3.6.1.2.1.1.2.0");
    println "snmpStatus=0";
}
catch (e){
    println "snmpStatus=1"
}

return 0

Does anyone else this, or something similar, to find machine where snmp is dead?  I see this person also has an WMI version as well which would be great for when WMI dies as well.

Thanks.

  • Yeap, I've also implemented snmp and wmi checks like that in the past. While there are some standard DataSources that do alert on NoData, many do not. It's easy for snmp to break and no one notice. I think I've implemented something like what you have above, except as an instance that only appears when there is a problem. That is just so it doesn't always show up resource tree for everything and I don't need to track/graph that data. But that also means there is a 15min delay in alerting but that was acceptable.

  • The "official" way is to alert on No Data from the Uptime module. Also depending on the environments you are monitoring, not everything will respond with a system hostname. I would suggest looking into combining several datapoints to check, uptime, hostname and maybe interfaces. If all three of those fail, then you know SNMP isn't working.

  • And how about for WMI?

    There's a module called WMI  Troubleshooter.  All it runs is this:

    /*******************************************************************************
     *  © 2007-2024 - LogicMonitor, Inc. All rights reserved.
     ******************************************************************************/
    import com.santaba.agent.groovyapi.win32.WMI
    
    def hostName = hostProps.get("system.hostname")
    
    try {
        WMI.queryFirst(hostName, "select version from Win32_OperatingSystem", 60)
    }
    catch (IOException e) {
        return 1
    }
    catch (Exception e)
    {
        // Not an IOException. Print the exception to stderr and exit non-zero
        e.printStackTrace
        return 3
    }
    
    return 0

    If it returns anything other than a 0, it alerts.  However, I have a couple machines returning a 0 but CPU and Uptime are both No Data which means WMI is definitely not working properly.

    We just don't like the Uptime alert because everyone has been trained to ignore those because they alert any time a machine was rebooted in the last hour.  We want to have a check with a name that actually tells you something is wrong.

    Maybe I'll try a Powershell command or something.

    • Joe_Williams's avatar
      Joe_Williams
      Icon for Professor rankProfessor

      The issue with WMI is there is a difference between a straight WMI call like your select version from Win32 and a CIM2 call.

      The UAC troubleshooter module might get you in the right direction to combine datapoints.

      I would reccomend just tracking what comes back and add that into a larger check. Does uptime return something, does os return something does X y or Z return something, then alert.

  • It seems to be a problem with only some WMI classes.   I found a server currently showing NO DATA for the SystemUptime thing.  When I go on it and check some WMI stuff I get different results.

    As you can see below, one WMI call works fine and another does not.  When I do a Poll Now on the System Uptime check, it apparently uses the second one which is why it's failing.

    Once I run through the LM WMI fixer steps, then both commands start working.  Now I'm wondering why some WMIs woudl work and other would not.  I guess I need to change my checker script to use the one that SystemUptime uses since CPU checks also use that one.

    • Joe_Williams's avatar
      Joe_Williams
      Icon for Professor rankProfessor

      If the Fixer guide helped, it is most likely broken counters. So it couldn't return data. Which would lead to invalid class. This happens on Windows from time to time, because, well it is Windows.

  • Did anyone find a solution for a simple datasource to alert if SNMP is not working?

    I know we can use Uptime to alert on 'no data' polls but agree with Kelemvor that more often than not these alerts get ignored.

    It would be nice to have something similar to the (say) vmware troubleshooter, which appears if SNMP is not working, but that we have the ability (perhaps via a property) to enable or disable this, as often times we might be doing ping-only monitoring of a device which doesn't support SNMP.