Forum Discussion
Hi friends, its been a few weeks so here's an update:
- the issue continues to occur, though not within the last week or so and the most-recent occurrences have involved only a handful of switches instead of the usual 20+. Nothing else to corroborate this with so I am just chalking it up to randomness of whatever is driving this
- the issue is replicated across all 3 of our 2-member ABCGs
- again, when the issue occurs there is not a corresponding loss of ping between one of the old servers (collector uninstalled) and a switch being impacted
- I have discovered though, that when the issue occurs a the collector responsible for that/those switch/es cannot ping nearly* all of our switches---even those switches whose responsibility is owned by a different ABCG. I say "nearly*" b/c the handful of switches that reside in our data center/s (which are the closest in terms of layer 1, layer 2 switching and layer 3 hops) seem to remain ping-responsive even when the issue is occurring otherwise. This point seems likely to be a clue to keep in mind.
- I've scrolled through successful and failing packet captures taken right on the collectors without being able to identify a difference
- when the issue is in progress, on affected switches the TCP UDP Stats datasource does not miss a beat and all snmp datasources continue uninterrupted. This seems likely to be an important point too, tending to support the idea that we don't have a "routing/switching end-to-end" connectivity problem. Though it doesn't say anything about treatment individual routers or switches may/may not be applying and under what circumstances that could be differing.
A next step would probably be capturing packets wire-by-wire/device-by-device/hop-by-hop in between collector/s and switch/es to try to identify where the pings actually disappear. That's a significant investment of time and effort, made more difficult by the fact that its not a configuration structure that can be left in place indefinitely while waiting for the issue to occur. Even then, success on that front might not expose why this never happened before collector 29.001
So the Ping datasource in our RG:SWITCHES/* will remain in SDT for the rest of this week, after which I think I'll be applying the NoPing property to the group and move on. Maybe one day all those socks that have gone missing in the laundry will turn up, scientists will pinpoint the origin of SARS-CoV2, and a light bulb will go off in my head with a solution to this problem.
Related Content
- 4 months ago
- 7 months ago