Ping Datasource periodic failure after Collector upgrade: any ideas?
Hi All
I have an open case on this, but since we haven't been able to solve it yet I am throwing this out there and hoping someone might take the time to read this and be willing to share an idea.
Details: collectors run on dedicated Windows VMs (2012R2) in QTY 3 2-member ABCGs, on top of VMWare 6.7.0, on top of Nutanix AOS (TBD) on SuperMicro HW.
Last fall I upgraded both members of an ABCG from 28.005 to 29.001. Though it seemed initially successful, at some point about 1-2 weeks later all of the (Juniper Networks) switches monitored by one of the collectors lit up my Alerts Tab with total failure of the Ping datasource. By the time I got through determining that all of the snmp datasources continued to report data for each of these switches and that there were no obvious network issues about an hour had gone by and the switches' Ping began to return to success. This same event occurred again within a few days, and after ruling out network or VMWare issues I simply rebooted the server supporting the collector and when it came back the issue was resolved. I decided to roll-back to collector 28.005 for the time being: the issue has not occurred a single time since rolling these back to 28.005 and it has never occurred with our 3rd ABCG that also at 28.005 and also has a significant number of Juniper switches under monitoring.
About 6-8 weeks later, I tried to upgrade a different 2-member ABCG from 28.005 this time to 29.003. Initially it was successful and then this same issue started occurring again only for the Juniper Networks switches the collector was monitoring. It happened many times, and after being unable to identify a cause on my own I opened a support case which has now been open since January. Even with their help the issue hasn't been able to be identified and after an escalation we came up with a list of potential actions on my part all of which are tedious and/or time-consuming and/or generally a big PITA. So I decided to take a sledgehammer approach: I went to our server team and requested all new Win 2016 servers (seems to be some lingering memory issues with Win 2019 and Collectors, so the recommendation is to avoid Win 2019 for now), on which I installed brand new 30.000 collectors. Worked great for nearly a week and then guess what.....yup, 2X over this past weekend the issue occurred; 2 collectors in different ABCGs and this time the duration was 3-4 hours instead of 60-90 minutes.
Here's what I can tell you:
- all collectors and monitored switches are on common campus infrastructure, every device in the data plane is under our control on our fiber and/or copper cabling (no WAN links are crossed)
- there is no firewall enforcement between collectors and the switches involved
- there is no ACL enforcement on the switches involved that would produce this result
- there is no Windows firewall enforcement on the servers hosting the collectors
- snmp between the collector and the switches continues to work properly while collector can't successfully ping the switches
- independent monitoring sourced in the same subnet as the collectors and destined for the same switches does not fail when the collectors can't ping the switches
- some of the switches have continuous ping destined for the collector which does not fail when the collector can't ping that switch
- the collectors successfully ping the remote router interface that serves as the gateway address for a switch even when the collector is not able to ping the switch itself
- during the time the issue is in progress, launching a ping to an affected switch from the command prompt on the relevant collector also times out
- none of the switches (or intermediate routers) log any events of interest at the start of the time the collector can't ping a switch, during the occurrence of the event, or at the resumption of successful ping from collector to switch
- the server team reports no log events of interest (Nutanix, VMWare, or server OS) at the start of the time the collector can't ping a switch, during the occurrence of the event, or at the resumption of successful ping from collector to switch
Anybody have any ideas....no matter how basic or simple?