Forum Discussion

Michael_Dieter's avatar
4 years ago

Ping Datasource periodic failure after Collector upgrade: any ideas?

Hi All

I have an open case on this, but since we haven't been able to solve it yet I am throwing this out there and hoping someone might take the time to read this and be willing to share an idea.

Details:  collectors run on dedicated Windows VMs (2012R2) in QTY 3 2-member ABCGs, on top of VMWare 6.7.0, on top of Nutanix AOS (TBD) on SuperMicro HW.

Last fall I upgraded both members of an ABCG from 28.005 to 29.001.  Though it seemed initially successful, at some point about 1-2 weeks later all of the (Juniper Networks) switches monitored by one of the collectors lit up my Alerts Tab with total failure of the Ping datasource.  By the time I got through determining that all of the snmp datasources continued to report data for each of these switches and that there were no obvious network issues about an hour had gone by and the switches' Ping began to return to success.  This same event occurred again within a few days, and after ruling out network or VMWare issues I simply rebooted the server supporting the collector and when it came back the issue was resolved.   I decided to roll-back to collector 28.005 for the time being: the issue has not occurred a single time since rolling these back to 28.005 and it has never occurred with our 3rd ABCG that also at 28.005 and also has a significant number of Juniper switches under monitoring.

About 6-8 weeks later, I tried to upgrade a different 2-member ABCG from 28.005 this time to 29.003.  Initially it was successful and then this same issue started occurring again only for the Juniper Networks switches the collector was monitoring.  It happened many times, and after being unable to identify a cause on my own I opened a support case which has now been open since January.   Even with their help the issue hasn't been able to be identified and after an escalation we came up with a list of potential actions on my part all of which are tedious and/or time-consuming and/or generally a big PITA.  So I decided to take a sledgehammer approach:  I went to our server team and requested all new Win 2016 servers  (seems to be some lingering memory issues with Win 2019 and Collectors, so the recommendation is to avoid Win 2019 for now), on which I installed brand new 30.000 collectors.  Worked great for nearly a week and then guess what.....yup, 2X over this past weekend the issue occurred; 2 collectors in different ABCGs and this time the duration was 3-4 hours instead of 60-90 minutes.

Here's what I can tell you:

  • all collectors and monitored switches are on common campus infrastructure, every device in the data plane is under our control on our fiber and/or copper cabling (no WAN links are crossed)
  • there is no firewall enforcement between collectors and the switches involved
  • there is no ACL enforcement on the switches involved that would produce this result
  • there is no Windows firewall enforcement on the servers hosting the collectors
  • snmp between the collector and the switches continues to work properly while collector can't successfully ping the switches
  • independent monitoring sourced in the same subnet as the collectors and destined for the same switches does not fail when the collectors can't ping the switches
  • some of the switches have continuous ping destined for the collector which does not fail when the collector can't ping that switch
  • the collectors successfully ping the remote router interface that serves as the gateway address for a switch even when the collector is not able to ping the switch itself
  • during the time the issue is in progress, launching a ping to an affected switch from the command prompt on the relevant collector also times out
  • none of the switches (or intermediate routers) log any events of interest at the start of the time the collector can't ping a switch, during the occurrence of the event, or at the resumption of successful ping from collector to switch
  • the server team reports no log events of interest (Nutanix, VMWare, or server OS) at the start of the time the collector can't ping a switch, during the occurrence of the event, or at the resumption of successful ping from collector to switch

Anybody have any ideas....no matter how basic or simple?  

 

  • Anonymous's avatar
    Anonymous

    Problem domain isolation: when the problem happens, if you are in a terminal window on the collector server (not the collector debug console), you can't ping the target device? If that's true, then you have eliminated one problem domain: LogicMonitor Collector component. If the server is having the issue regardless of what is installed on it, it's not what's being installed on it. I know it'd be a PITA, but could you standup a collector server and not install the collector to see if you still experience the issue? What about a tracert while the problem is happening?

  • Hey Stuart

    Thanks. Yes, I know.  When the issue is in occurrence via the Ping datasource and I open a terminal prompt on the relevant collector, manually instantiating a ping and a traceroute both fail.  That is a pretty clear indicator of where the problem lies.

    But it is also a way to try to eliminate problem domains by looking at what variable has changed....and in this situation the compute and storage HW didn't change, the hypervisor didn't change, the host OS didn't change, the network connectivity (physical and logical) didn't change.  The only thing in the stack that changed is the collector version, and when I undo that change by rolling the collector version back the issue goes away.

    Seems like yes the issue is in our environment, but the triggering event is a stack interaction that occurs differently---or not at all-- before 29.001.

    Hugely frustrating and moving forward is stepping into a giant time-sucking rabbit hole trying to isolate variables to ID that interaction.  On the other hand, I have considered a second sledgehammer of simply eliminating use of the Ping datasource but everybody who may be reading could probably come up with several reasons why that might not be a good approach.  So hopefully someone will post an idea to help focus the search really narrowly that is achievable at a relatively low cost.  I'm giving it a few days but that sledgehammer is leaning against the wall right over there.....

  • Anonymous's avatar
    Anonymous
    15 hours ago, Stuart Weenig said:

    I know it'd be a PITA, but could you standup a collector server and not install the collector to see if you still experience the issue?

     

  • I don't have the authority or access to stand up servers on my own, but what I can/will do is use one of the W2012 servers supporting the original collectors:

    • uninstall collector
    • remove collector from monitoring
    • reboot server
    • launch multiple continuous ping to relevant switches
    • wait, watch, note ping status when those switches' Ping datasource monitored by new collector fails

    This should be in place by the end of the day as soon as I finish migrating all resources into the new ABCGs composed of new W2016 servers/collectors.

     

  • OK kids.  Good news and bad news to report.

    GOOD: All devices are migrated to new W2016 servers running freshly installed 30.000 collectors.  In addition, one of the old collectors has been deleted out of our portal and confirmed uninstalled from the server hosting it.  That server is now running multiple continuous pings to a range of switches that have been/continue to be involved in this issue.....and these pings continue successfully uninterrupted even when the datasource running on the new collector is reporting 100% loss.

    BAD: the issue continues to occur, without a pattern I can determine and without any thus-far identifiable causal or associated events that trigger or clear the issue.

    GOOD: since the issue continues, I'm getting ample opportunity to collect situational, contextual information.

    BAD: I've burned several hours on this already this morning and I haven't yet been able to turn any of this information into a narrowed list of prospective solution/s that seem to be worth further exploration.

    ???

  • Hi friends, its been a few weeks so here's an update:

    • the issue continues to occur, though not within the last week or so and the most-recent occurrences have involved only a handful of switches instead of the usual 20+.  Nothing else to corroborate this with so I am just chalking it up to randomness of whatever is driving this
    • the issue is replicated across all 3 of our 2-member ABCGs
    • again, when the issue occurs there is not a corresponding loss of ping between one of the old servers (collector uninstalled) and a switch being impacted
    • I have discovered though, that when the issue occurs a the collector responsible for that/those switch/es cannot ping nearly* all of our switches---even those switches whose responsibility is owned by a different ABCG.  I say "nearly*" b/c the handful of switches that reside in our data center/s (which are the closest in terms of layer 1,  layer 2 switching and layer 3 hops) seem to remain ping-responsive even when the issue is occurring otherwise.  This point seems likely to be a clue to keep in mind.
    • I've scrolled through successful and failing packet captures taken right on the collectors without being able to identify a difference
    • when the issue is in progress,  on affected switches the TCP UDP Stats datasource does not miss a beat and all snmp datasources continue uninterrupted. This seems likely to be an important point too, tending to support the idea that we don't have a "routing/switching end-to-end" connectivity problem.  Though it doesn't say anything about treatment individual routers or switches may/may not be applying and under what circumstances that could be differing.

    A next step would probably be capturing packets wire-by-wire/device-by-device/hop-by-hop in between collector/s and switch/es to try to identify where the pings actually disappear.  That's a significant investment of time and effort, made more difficult by the fact that its not a configuration structure that can be left in place indefinitely while waiting for the issue to occur.  Even then, success on that front might not expose why this never happened before collector 29.001

    So the Ping datasource in our RG:SWITCHES/* will remain in SDT for the rest of this week, after which I think I'll be applying the NoPing property to the group and move on.  Maybe one day all those socks that have gone missing in the laundry will turn up, scientists will pinpoint the origin of SARS-CoV2, and a light bulb will go off in my head with a solution to this problem.

  • Anonymous's avatar
    Anonymous

    Yeah, my gut instinct is that some dynamic filtering is happening somewhere on your network. If one protocol works and others don't that leans toward firewall. IPS maybe?

  • I'm with you on the dynamic filtering wavelength, but its not being enforced by FW or IPS.....there is neither between collectors and the switches at L2 or L3.   Another possibility would be a failure of LACP load-balancing hashing between 2 devices using link aggregation between them, resulting in some frames disappearing into a black hole.  Juniper's default mechanism relies on L2 source & destination.  Where this idea runs into trouble is that L2 src & dst are static (unless they are not, but this requires more effort to run down and there are no indicators of concurrent issues that would be expected with fluctuating L2 src or dst); frames disappearing into a black hole would be doing so independent of tcp/udp/ip/icmp payload and it seems too unlikely that icmp is the only one that ever vanishes.  Still, this thread could be investigated further but of course at a high cost in terms of time & effort.  

    Next thought was that CoS queues are starving out icmp at some point between collector and switch.  We rely on Juniper CoS default behavior, however the end to end data path does traverse multiple different switch and router families and the default treatment is not identical in each. So another interesting thread to pursue but one with an even worse time/effort cost profile.  On the other hand, possibly the biggest of many problems with this idea is that the end to end CoS treatment is static too (we haven't ripped and replaced HW, and the respective Juniper OS versions haven't changed)  so believing this is the cause requires me to accept that it is purely co-incidental that the issue never occurred in the 5+ years and trillions of pings that took place up to and including collector v28.005

    Which brings me full-circle: a number of competing possibilities each of which is plausible but all of which come with a high price tag that I have to pay by myself.  My Venmo balance checkbook balance (don't laugh, I actually do still have a paper checkbook) is pretty good but I'm not blowing it on this.  This is not a rabbit hole, its a worm-hole; I'm going to have to tolerate that I have to go around it right now.

  • Anonymous's avatar
    Anonymous

    Yeah, forget this problem and focus on why you are not alerting on CoS queue drops through LM...