Ping Datasource periodic failure after Collector upgrade: any ideas?

Question

Hi All

I have an open case on this, but since we haven't been able to solve it yet I am throwing this out there and hoping someone might take the time to read this and be willing to share an idea.

Details:&nbsp; collectors run on dedicated Windows VMs (2012R2) in QTY 3 2-member ABCGs, on top of VMWare 6.7.0, on top of Nutanix AOS (TBD) on SuperMicro HW.

Last fall I upgraded both members of an ABCG from 28.005 to 29.001.&nbsp; Though it seemed initially successful, at some point about 1-2 weeks later all of the (Juniper Networks) switches monitored by one of the collectors lit up my Alerts Tab with total failure of the Ping datasource.&nbsp; By the time I got through determining that all of the snmp datasources continued to report data for each of these switches and that there were no obvious network issues about an hour had gone by and the switches' Ping began to return to success.&nbsp; This same event occurred again within a few days, and after ruling out network or VMWare issues I simply rebooted the server supporting the collector and when it came back the issue was resolved.&nbsp;&nbsp; I decided to roll-back to collector 28.005 for the time being: the issue has not occurred a single time since rolling these back to 28.005 and it has never occurred with our 3rd ABCG that also at 28.005 and also has a significant number of Juniper switches under monitoring.

About 6-8 weeks later, I tried to upgrade a different 2-member ABCG from 28.005 this time to 29.003.&nbsp; Initially it was successful and then this same issue started occurring again only for the Juniper Networks switches the collector was monitoring.&nbsp; It happened many times, and after being unable to identify a cause on my own I opened a support case which has now been open since January.&nbsp;&nbsp; Even with their help the issue hasn't been able to be identified and after an escalation we came up with a list of potential actions on my part all of which are tedious and/or time-consuming and/or generally a big PITA.&nbsp; So I decided to take a sledgehammer approach:&nbsp; I went to our server team and requested all new Win 2016 servers&nbsp; (seems to be some lingering memory issues with Win 2019 and Collectors, so the recommendation is to avoid Win 2019 for now), on which I installed brand new 30.000 collectors.&nbsp; Worked great for nearly a week and then guess what.....yup, 2X over this past weekend the issue occurred; 2 collectors in different ABCGs and this time the duration was 3-4 hours instead of 60-90 minutes.

Here's what I can tell you:

all collectors and monitored switches are on common campus infrastructure, every device in the data plane is under our control on our fiber and/or copper cabling (no WAN links are crossed)

there is no firewall enforcement between collectors and the switches involved

there is no ACL enforcement on the switches involved that would produce this result

there is no Windows firewall enforcement on the servers hosting the collectors

snmp between the collector and the switches continues to work properly while collector can't successfully ping the switches

independent monitoring sourced in the same subnet as the collectors and destined for the same switches does not fail when the collectors can't ping the switches

some of the switches have continuous ping destined for the collector which does not fail when the collector can't ping that switch

the collectors successfully ping the remote router interface that serves as the gateway address for a switch even when the collector is not able to ping the switch itself

during the time the issue is in progress, launching a ping to an affected switch from the command prompt on the relevant collector also times out

none of the switches (or intermediate routers) log any events of interest at the start of the time the collector can't ping a switch, during the occurrence of the event, or at the resumption of successful ping from collector to switch

the server team reports no log events of interest (Nutanix, VMWare, or server OS) at the start of the time the collector can't ping a switch, during the occurrence of the event, or at the resumption of successful ping from collector to switch

Anybody have any ideas....no matter how basic or simple?&nbsp;&nbsp;

&nbsp;

Answer

Problem domain isolation: when the problem happens, if you are in a terminal window on the collector server (not the collector debug console), you can't ping the target device? If that's true, then you have eliminated one problem domain: LogicMonitor Collector component. If the server is having the issue regardless of what is installed on it, it's not what's being installed on it. I know it'd be a PITA, but could you standup a collector server and not install the collector to see if you still experience the issue? What about a tracert while the problem is happening?

michael_dieter · Answer

Hey Stuart

Thanks. Yes, I know.&nbsp; When the issue is in occurrence via the Ping datasource and I open a terminal prompt on the relevant collector, manually instantiating a ping and a traceroute both fail.&nbsp; That is a pretty clear indicator of where the problem lies.

But it is also a way to try to eliminate problem domains by looking at what variable has changed....and in this situation the compute and storage HW didn't change, the hypervisor didn't change, the host OS didn't change, the network connectivity (physical and logical) didn't change.&nbsp; The only thing in the stack that changed is the collector version, and when I undo that change by rolling the collector version back the issue goes away.

Seems like yes the issue is in our environment, but the triggering event is a stack interaction that occurs differently---or not at all-- before 29.001.

Hugely frustrating and moving forward is stepping into a giant time-sucking rabbit hole trying to isolate variables to ID that interaction.&nbsp; On the other hand, I have considered a second sledgehammer of simply eliminating use of the Ping datasource but everybody who may be reading could probably come up with several reasons why that might not be a good approach.&nbsp; So hopefully someone will post an idea to help focus the search really narrowly that is achievable at a relatively low cost.&nbsp; I'm giving it a few days but that sledgehammer is leaning against the wall right over there.....

Answer

15 hours ago, Stuart Weenig said:

I know it'd be a PITA, but could you standup a collector server and not install the collector to see if you still experience the issue?

&nbsp;

michael_dieter · Answer

I don't have the authority or access to stand up servers on my own, but what I can/will do is use one of the W2012 servers supporting the original collectors:

uninstall collector

remove collector from monitoring

reboot server

launch multiple continuous ping to relevant switches

wait, watch, note ping status when those switches' Ping datasource monitored by new collector fails

This should be in place by the end of the day as soon as I finish migrating all resources into the new ABCGs composed of new W2016 servers/collectors.

&nbsp;

michael_dieter · Answer

OK kids.&nbsp; Good news and bad news to report.

GOOD: All devices are migrated to new W2016 servers running freshly installed 30.000 collectors.&nbsp; In addition, one of the old collectors has been deleted out of our portal and confirmed uninstalled from the server hosting it.&nbsp; That server is now running multiple continuous pings to a range of switches that have been/continue to be involved in this issue.....and these pings continue successfully uninterrupted even when the datasource running on the new collector is reporting 100% loss.

BAD: the issue continues to occur, without a pattern I can determine and without any thus-far identifiable causal or associated events that trigger or clear the issue.

GOOD: since the issue continues, I'm getting ample opportunity to collect situational, contextual information.

BAD: I've burned several hours on this already this morning and I haven't yet been able to turn any of this information into a narrowed list of prospective solution/s that seem to be worth further exploration.

???

Forum Discussion

Ping Datasource periodic failure after Collector upgrade: any ideas?

9 Replies

Recent Discussions

Created custom properties when discovering devices

Deep Dive troubleshooting question

Where are LLDP and CDP categories set?

How to alert when we STOP receiving logs?

LM Uptime - Anyone know how to use it?