Forum Discussion

David_Lee's avatar
8 years ago

497 days and counting........

You might have received an alert saying your linux based device has just rebooted, but you know that it has been up a long time.

A switch might have just sent an alert for every interface flapping when they have all been up solidly.

The important question to ask here is how long has the device been up?

If its been up for 497 days,994 days,1491 days or any multiple of 497 then you are seeing the 497 day bug, that hits almost every linux based device that is up for a good length of time.

Anything using a kernel less than 2.6 computes the system uptime based on the internal jiffies counter, which counts the time since boot in units of 10 milliseconds, or jiffies. This counter is a 32-bit counter, which has a maximum value of 2^32, or 4,294,967,296. 

When the counter reaches this value (after 497 days, 2 hours, 27 minutes, and 53 seconds, or approximately 16 months), it wraps back around to zero and continues to increment.

This can result in alerts about reboots that didn’t happen and cause switches to report a flap on all interfaces.

Systems that use 2.6 Kernel and properly supply a 64 bit counter will still alert incorrectly when the 64 bit counter wraps.

A 32 bit counter can hold 4,294,967,295( /4,294,967,295864000/8640000 = 497.1 days)

A 64 bit counter can hold 18,446,744,073,709,551,615 .   (18,446,744,073,709,551,615/8640000 = 2135039823346 days or 5849424173 years)

Though I expect in 6,000 million years we will all have other things to worry over.

  • We handle this properly in LogicMonitor now, if you import these modules:

    DataSource: SNMP_HostUptime_Singleton - 42439N
    PropertySource: addCategory_snmpUptime - T9WFM4

    These replace SNMPUptime-.
     

  • Is it possible to add more detail on this topic?  I tried to implement the fix last night for a couple Cisco fabric interconnect switches but it didn't seem to work.  I have several "Uptime" datasources now and I don't know which one includes the fix or what device types they apply to.  I also have a question about how this fix works.  Is the fix purely on the LM datasource side or is a software update of some kind required on the Cisco side?

    I now have at least 5 different datasources that appear to monitor uptime.  Whats the best way to identify which ones include the fix?

    Host Uptime-  SNMP OID field says .1.3.6.1.2.1.25.1.1

    HostUptime-  SNMP OID field says .1.3.6.1.2.1.25.1.1

    SNMP_Engine_Uptime-  SNMP OID field says 1.3.6.1.6.3.10.2.1.3

    SNMP_HostUptime_Singleton  SNMP OID field says .1.3.6.1.2.1.25.1.1.0

    SNMPUptime-  SNMP OID field says .1.3.6.1.2.1.1.3

     

  • @Kwoodhouse the one that includes the fix is SNMP_HostUptime_Singleton. It requires the addCategory_snmpUptime PropertySource to work without manual intervention.

    "HostUptime-" (no space) is deprecated and no longer in core. Unfortunately there's no way for you to get that information in your account currently.

    SNMPUptime and SNMP_Engine_Uptime- are more or less duplicates. They both get the uptime for the agent, not the host. This seems to be an oversight.

    Originally, we just looked at the uptime counter with a gauge datapoint. If the value indicated uptime of less than 60 seconds, we'd alert. Of course, this happens during a counter wrap. To fix it, we started tracking the uptime counter with a counter. Given that the rate of time is constant, we should always see the rate of 100 ticks/second coming back from the counter datapoint if the host hasn't been rebooted.

    The logic in the UptimeAlert CDP looks at both that tick rate, and the raw uptime to determine if the host has rebooted, or the counter has just wrapped. If it's just a counter wrap (no reboot), we'll see 100 ticks/second, even if we see less than 60 seconds of uptime with the gauge. If it's rebooted, the UptimeCounter datapoint could return either No Data (counters need 2 consecutive polls), or, it will return a huge value because no polls were missed, and LM assumed the counter wrapped when it was really reset due to reboot.

    This is explained in the datapoint descriptions, but is admittedly a bit difficult to grok without an intimate understanding of how LM's counter/derive work. I do still think it's a rather ingenious solution.

    We use "102" instead of "100" ticks/second in the CDP to avoid false positives, as the collection interval isn't always exactly a minute.

    I recommend this blog if you're interested in learning more about counter/drive: https://www.logicmonitor.com/blog/the-difference-between-derive-and-counter-datapoints/

    I will talk to the Monitoring team about removing some of those duplicates, and getting a public document up explaining it all.

  • Hi Michael. 

    After spending two weeks troubleshooting this with LogicMonitor support it turns out that the updated datasources will not fix the problem for my devices.  The initial support rep said these datasources would solve the problem and this blog post makes it seem as though updated datasources are all you need.  I think maybe a more detailed write up here would be great letting people know that while LogicMonitor has updated datasources that are capable of resolving the problem the requirement is that your hosts respond to .1.3.6.1.2.1.25.1.1.0.  If you update your software to the latest version and still don't get an snmpwalk response from oid .1.3.6.1.2.1.25.1.1.0 then these updated datasources will NOT resolve the 497 day uptime issue.  At which point you need to work on an ssh/telnet datasource to check the uptime from the CLI or train your internal team to realize if the uptime was 497 days its likely a false alarm.  Maybe having a better alert message containing info about a known issue regarding 497days uptime would also be good.  Either way, having this info included here would likely have saved me two weeks of troubleshooting.  Hopefully this helps someone else.  The devices involved for me were Cisco fabric interconnect switches, UCS-FI-6248UP with the latest 3.2.3i software.  

    Best Regards. 

  • Hey @Kwoodhouse, sorry for the confusion. The fix does rely on your host reporting system uptime as defined in the Host Resources MIB (specifically, hrSystemUptime at .1.3.6.1.2.1.25.1.1.0).

    If that doesn't OID doesn't return anything, we fall back to using snmpEngineTime. This isn't necessarily the uptime of the system, but rather the uptime of the snmp agent, and it will reset with the agent even if the system does not reboot. The fix was never ported to the module that retrieves Engine Uptime, but it should be easy enough to do. I've put a fix in with the ME team to get this done.

    I did go ahead and update the alert message in the meantime. Thanks for bringing this to our attention!