Hello I am getting errors:“VIPEIEEMP01 is suffering ping loss. 100.0% of pings are not returning, placing the host into critical state.”“The host VIPEIEEMP01 is down. No data has been received” If I do a ping from the command prompt of the Client to the Collector it worksThe firewall is wide opened!!!What did I miss?Thanks,Dom

@mnagel For us it did resolve it. We had consistent collectors and devices that this happened with. After adjusting the threadpools, those collectors and devices didn’t show the issue.

Yes I think the collector are overloaded as all of them for the datacenter implied here are over the Load Balancing settings which would mean there is no more balancing anymore !!!Auto balanced collector group (ABCG) does not mean load balancing. It is overload redistribution. Example: Two collectors with rebalance threshold of 10,000. One collector has 9,999, the other has 5. No rebalancing happens in this case.When/if the first collector gets more than 10,000, the device with the highest number of instances will be reassigned to a different collector. If that brings the count back below 10,000, rebalancing stops. Example: Two collectors with rebalance threshold of 10,000. One collector has 10,001, the other has 5. Largest device on the first collector has 200 instances. Rebalancing happens. The first collector ends up with 9,801, the other has 205.ABGC != Load Balancing.

We have observered similar issues with Ping failing via the collector application but not from the OS it self. A restart would resolve the issue for awhile. What we found was adjusting the threadpool for Ping to fix our problem. It seems the collector would max out of threads for ping and then just start failing. After adjusting the threadpool count we didn’t encounter this issue again. Are you certain that fixed the issue? Because that would require a collector restart and that is what fixes the problem since it forces generation of new ID values for ICMP and SNMP “sessions”.This problem has been going on for years and LM seems to have no plan to fix it. We routinely lose hours of data due to intermediate firewall session invalidation and I’ve seen only a glimmer of interest from folks at LM. The collector code needs to be updated to generate new ICMP and SNMP “sessions” for each check, or at least do so periodically (e.g., every 5-10 minutes) so this stops happening.

I am pleased to announce that LM has (after nearly 5 years of back-and-forth -- my first attempt to get this addressed was in June 2018) finally has fixed both the SNMP and ping issues impacted by intermediate firewall session invalidation -- update from support last week:Our development team has acknowledged the issues you outlined with Ping. Currently the behavior is to have cached sessions for ICMP ping and then reuse them, only refreshing the cache on sbproxy restart. An alternative has been in development and will be fixed in the next EA release. Similar issues with SNMP have been addressed already in EA 34.100.Hopefully this is actually the case, but if so it will be very nice to tell our clients this longtime bug has finally been quashed.

On 12/12/2022 at 12:23 PM, mnagel said:I would not hold my breath. I pushed my CSM at the time on this issue back in 2018 and they refused to consider it a bug, but a feature request. I also brought that up recently with my current CSM and I got crickets. I dutifully followed through, but since the feature request "system" is nearly worthless, nothing has been done. I cannot begin to enumerate the number of embarrassing conversations with clients that start like "Why is LogicMonitor alarming about SNMP being down or a host being down when we can get to those devices just fine?" The workaround is time-intensive (manual collector restarts) and the repeated data loss is unforgivable. I don't know what possessed the developers to generate a fixed SNMP session ID or ICMP ID once when the collector starts rather than at each new get/walk or ping. It is the ultimate in false optimization and causes unending problems for any but the smallest simplest network. LM should be ashamed of letting this continuing to happen.I've pushed our CSM a good deal on infrequent SNMP failures, and I was able to hop on a Zoom call with their Devs and walk them through the exact behavior we were seeing (I keyed on this by virtue of alerts for SNMP host down, despite 'Poll Now' clearly showing a response), and I recently received some tacit confirmation that this 'bug' was, indeed, acknowledged. I wasn't aware of the exact cause of the problem, but was able to clearly demonstrate to them that this presents as a bug, clearly not a 'feature' request.Obviously, no indication of timeline for fix, but this seemed like something that finally landed -- Hopefully this will lead to a fundamental fix. I'll be sure to highlight this thread to our CSM as well, in case that might help at all.

Ping Failing from collector to device and back?

40 Replies

Joe_Williams
Professor
3 years ago
@mnagel For us it did resolve it. We had consistent collectors and devices that this happened with. After adjusting the threadpools, those collectors and devices didn’t show the issue.
Anonymous
3 years ago
Yes I think the collector are overloaded as all of them for the datacenter implied here are over the Load Balancing settings which would mean there is no more balancing anymore !!!
Auto balanced collector group (ABCG) does not mean load balancing. It is overload redistribution.
Example: Two collectors with rebalance threshold of 10,000. One collector has 9,999, the other has 5. No rebalancing happens in this case.
When/if the first collector gets more than 10,000, the device with the highest number of instances will be reassigned to a different collector. If that brings the count back below 10,000, rebalancing stops.
Example: Two collectors with rebalance threshold of 10,000. One collector has 10,001, the other has 5. Largest device on the first collector has 200 instances. Rebalancing happens. The first collector ends up with 9,801, the other has 205.
ABGC != Load Balancing.
mnagel
Professor
3 years ago

We have observered similar issues with Ping failing via the collector application but not from the OS it self. A restart would resolve the issue for awhile. What we found was adjusting the threadpool for Ping to fix our problem. It seems the collector would max out of threads for ping and then just start failing. After adjusting the threadpool count we didn’t encounter this issue again.

Are you certain that fixed the issue? Because that would require a collector restart and that is what fixes the problem since it forces generation of new ID values for ICMP and SNMP “sessions”.
This problem has been going on for years and LM seems to have no plan to fix it. We routinely lose hours of data due to intermediate firewall session invalidation and I’ve seen only a glimmer of interest from folks at LM. The collector code needs to be updated to generate new ICMP and SNMP “sessions” for each check, or at least do so periodically (e.g., every 5-10 minutes) so this stops happening.
mnagel
Professor
3 years ago
I am pleased to announce that LM has (after nearly 5 years of back-and-forth -- my first attempt to get this addressed was in June 2018) finally has fixed both the SNMP and ping issues impacted by intermediate firewall session invalidation -- update from support last week:
Our development team has acknowledged the issues you outlined with Ping. Currently the behavior is to have cached sessions for ICMP ping and then reuse them, only refreshing the cache on sbproxy restart. An alternative has been in development and will be fixed in the next EA release. Similar issues with SNMP have been addressed already in EA 34.100.
Hopefully this is actually the case, but if so it will be very nice to tell our clients this longtime bug has finally been quashed.
AustinC
Neophyte
3 years ago
On 12/12/2022 at 12:23 PM, mnagel said:
I would not hold my breath. I pushed my CSM at the time on this issue back in 2018 and they refused to consider it a bug, but a feature request. I also brought that up recently with my current CSM and I got crickets. I dutifully followed through, but since the feature request "system" is nearly worthless, nothing has been done. I cannot begin to enumerate the number of embarrassing conversations with clients that start like "Why is LogicMonitor alarming about SNMP being down or a host being down when we can get to those devices just fine?" The workaround is time-intensive (manual collector restarts) and the repeated data loss is unforgivable. I don't know what possessed the developers to generate a fixed SNMP session ID or ICMP ID once when the collector starts rather than at each new get/walk or ping. It is the ultimate in false optimization and causes unending problems for any but the smallest simplest network. LM should be ashamed of letting this continuing to happen.
I've pushed our CSM a good deal on infrequent SNMP failures, and I was able to hop on a Zoom call with their Devs and walk them through the exact behavior we were seeing (I keyed on this by virtue of alerts for SNMP host down, despite 'Poll Now' clearly showing a response), and I recently received some tacit confirmation that this 'bug' was, indeed, acknowledged. I wasn't aware of the exact cause of the problem, but was able to clearly demonstrate to them that this presents as a bug, clearly not a 'feature' request.
Obviously, no indication of timeline for fix, but this seemed like something that finally landed -- Hopefully this will lead to a fundamental fix. I'll be sure to highlight this thread to our CSM as well, in case that might help at all.
Michael_Dieter
Neophyte
3 years ago
This behavior seems like it may be related to, or overlap with, an issue I first observed in January 2020 (I think) and was never able to resolve.
Randomly, subsets of our Juniper switches (and only switches, no other devices) would trip alerts indicating 100% ping loss. It would usually auto-resolve after 60-90 minutes and never left evidence behind--that I could find--why the condition started or cleared up.
During the time the alerts were in effect, I had other non-collector sources of pings to the same switches that were not disrupted and I could ping back to the collectors involved from the switch command line. SSH, SNMP, other communication between collectors and switches showed no problem.
Of note, none of the traffic between collectors and switches traversed a firewall.
The real kicker to me was that I had never seen this behavior until I upgraded collectors to 29.003. If I rolled collectors back to 28.x, the issue did not occur. As soon as I pushed forward again to 29.x it started happening again. I opened a case with support and I spent a lot of tedious time trying to figure out where traffic was getting dropped to no avail; after several months I was not able to convince them to move from their “its something in your environment” stance. As much as I wanted an answer I simply could not afford to devote the time needed to sustain an investigation.
Ultimately I applied system.category “NoPing” to switches and moved on.
MaddyM
3 years ago
@Joe Williams
what size were your collectors and what did you end up adjusting the threadpool too?
I’m new to LM and have been working on implementing it within my company the past few months. I’ve been facing this issue since December and have been going back and forth with LM Support. First, they had me do a fresh install, then they said there was a bug in v 33.x for the collectors and I needed to downgrade and the last step was to completely delete the collectors from the portal and add them again. At this point LM support claims they have done everything they can and have escalated to an internal ticket but I haven’t been able to receive any updates on the status of that ticket. I’m willing to make adjusts me to the threadpools/timeouts if needed to see if this resolved the issue for me.
mnagel
Professor
4 years ago
6 minutes ago, Dominique said:
Hello I am getting errors:“VIPEIEEMP01 is suffering ping loss. 100.0% of pings are not returning, placing the host into critical state.”
“The host VIPEIEEMP01 is down. No data has been received”

If I do a ping from the command prompt of the Client to the Collector it works
The firewall is wide opened!!!
What did I miss?
Thanks,
Dom
This is an old bug. I tried (and keep trying) to get it fixed for years, but no luck so far. The problem is that any firewall keeps a session table. ping is not session-based like TCP, but firewalls still keep track of ICMP ID and use timers to invalidate sessions. Same for UDP (SNMP). The LM collector code is "lazy" and does not generate new session-equivalents for successive checks, so eventually the traffic is dropped by the firewall because it is matched to an invalid session. You can workaround this by restarting the collector. For SNMP they have recently added some knobs in the collector config to help, but for ICMP it is still messed up.
Mike_Moniz
Professor
4 years ago
Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?
mnagel
Professor
4 years ago

2 minutes ago, Mike Moniz said:

Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

The world is passing LM by there -- most organizations these days are moving to internal compartmentalization, which means firewalls of some sort. We have generally seen it for smaller remote sites that have no desire or facilities for a local collector. I don't think it matters what the collector platform is since the checks are all the same (Java/Groovy). I tried a lot with our CSM back in 2018 and while they agreed in principal, it was considered a "feature request" and well, we know where those generally end up. I was also treated to circular logic (determined to be a problem with Palo Alto firewalls only, which is completely untrue, and evidence was an LM support page saying that Palo Altos have that problem).