Hello I am getting errors:“VIPEIEEMP01 is suffering ping loss. 100.0% of pings are not returning, placing the host into critical state.”“The host VIPEIEEMP01 is down. No data has been received” If I do a ping from the command prompt of the Client to the Collector it worksThe firewall is wide opened!!!What did I miss?Thanks,Dom

@mnagel For us it did resolve it. We had consistent collectors and devices that this happened with. After adjusting the threadpools, those collectors and devices didn’t show the issue.

Yes I think the collector are overloaded as all of them for the datacenter implied here are over the Load Balancing settings which would mean there is no more balancing anymore !!!Auto balanced collector group (ABCG) does not mean load balancing. It is overload redistribution. Example: Two collectors with rebalance threshold of 10,000. One collector has 9,999, the other has 5. No rebalancing happens in this case.When/if the first collector gets more than 10,000, the device with the highest number of instances will be reassigned to a different collector. If that brings the count back below 10,000, rebalancing stops. Example: Two collectors with rebalance threshold of 10,000. One collector has 10,001, the other has 5. Largest device on the first collector has 200 instances. Rebalancing happens. The first collector ends up with 9,801, the other has 205.ABGC != Load Balancing.

We have observered similar issues with Ping failing via the collector application but not from the OS it self. A restart would resolve the issue for awhile. What we found was adjusting the threadpool for Ping to fix our problem. It seems the collector would max out of threads for ping and then just start failing. After adjusting the threadpool count we didn’t encounter this issue again. Are you certain that fixed the issue? Because that would require a collector restart and that is what fixes the problem since it forces generation of new ID values for ICMP and SNMP “sessions”.This problem has been going on for years and LM seems to have no plan to fix it. We routinely lose hours of data due to intermediate firewall session invalidation and I’ve seen only a glimmer of interest from folks at LM. The collector code needs to be updated to generate new ICMP and SNMP “sessions” for each check, or at least do so periodically (e.g., every 5-10 minutes) so this stops happening.

I am pleased to announce that LM has (after nearly 5 years of back-and-forth -- my first attempt to get this addressed was in June 2018) finally has fixed both the SNMP and ping issues impacted by intermediate firewall session invalidation -- update from support last week:Our development team has acknowledged the issues you outlined with Ping. Currently the behavior is to have cached sessions for ICMP ping and then reuse them, only refreshing the cache on sbproxy restart. An alternative has been in development and will be fixed in the next EA release. Similar issues with SNMP have been addressed already in EA 34.100.Hopefully this is actually the case, but if so it will be very nice to tell our clients this longtime bug has finally been quashed.

On 12/12/2022 at 12:23 PM, mnagel said:I would not hold my breath. I pushed my CSM at the time on this issue back in 2018 and they refused to consider it a bug, but a feature request. I also brought that up recently with my current CSM and I got crickets. I dutifully followed through, but since the feature request "system" is nearly worthless, nothing has been done. I cannot begin to enumerate the number of embarrassing conversations with clients that start like "Why is LogicMonitor alarming about SNMP being down or a host being down when we can get to those devices just fine?" The workaround is time-intensive (manual collector restarts) and the repeated data loss is unforgivable. I don't know what possessed the developers to generate a fixed SNMP session ID or ICMP ID once when the collector starts rather than at each new get/walk or ping. It is the ultimate in false optimization and causes unending problems for any but the smallest simplest network. LM should be ashamed of letting this continuing to happen.I've pushed our CSM a good deal on infrequent SNMP failures, and I was able to hop on a Zoom call with their Devs and walk them through the exact behavior we were seeing (I keyed on this by virtue of alerts for SNMP host down, despite 'Poll Now' clearly showing a response), and I recently received some tacit confirmation that this 'bug' was, indeed, acknowledged. I wasn't aware of the exact cause of the problem, but was able to clearly demonstrate to them that this presents as a bug, clearly not a 'feature' request.Obviously, no indication of timeline for fix, but this seemed like something that finally landed -- Hopefully this will lead to a fundamental fix. I'll be sure to highlight this thread to our CSM as well, in case that might help at all.

Ping Failing from collector to device and back?

40 Replies

Anonymous
4 years ago

1 hour ago, Mike Moniz said:

have collectors located on the same network segment as the devices being monitored so not hitting firewalls

This is the recommended architecture.

1 hour ago, mnagel said:

most organizations these days are moving to internal compartmentalization, which means firewalls of some sort

If your firewalls are blocking legitimate business traffic, they need to not do that.
mnagel
Professor
4 years ago

5 minutes ago, Stuart Weenig said:

This is the recommended architecture.

If your firewalls are blocking legitimate business traffic, they need to not do that.

Folks are not going to place collectors in every subnet and due to increased security concerns, there will be more and more situations where this will be an issue.

As far as "blocking legitimate traffic" that is not what is happening here (OP specifically said the firewall was wide open). It is allowing the traffic, but firewalls track sessions and due to bad programming, LM triggers firewalls to block traffic in some cases. For example, we had a remote location (all WAN sites transit firewalls, a very common architecture) that had suffered a power outage. Pings began failing because LM reuses the same ICMP ID forever and the original session established previously was no longer valid.

As I mentioned, I escalated this to our CSM in 2018 and got back "I get it, but you need to open a feature request". Since then, someone in LM has at least figured out this needs support for SNMP -- this is what we were provided and it works fairly well (that seems to only target SNMPv3, but we tend to use that when possible so it is OK).

By default, the collector does not change the SNMP library session until a collector restart, which is why that resolves the issue. You may be able to work around this by adjusting the following fields in the collector debug. snmp.shareThreads.impl.v3.switchport.enable=true snmp.shareThreads.impl.v3.initialCheckDelay.minutes=3

We get frequent annoyed tickets from clients who are told by LM that a host is not responding to ping, which is trivially proved wrong by them. Our only solution when it happens is to restart the collector. You know, rather than LM fixing broken code.
mnagel
Professor
4 years ago

19 minutes ago, Stuart Weenig said:

I get and agree that LM is causing the case, but if the firewall is eventually blocking the traffic, the firewall blocking the legitimate traffic (that looks illegitimate). Do you know the FR number?

It is no longer legitimate traffic after the "session" is invalid -- this is common firewall behavior that would be avoided by using a fresh ICMP ID for each ping check, like everyone else does.

FR number? When did that start being a thing? I thought you just create them in the forum and cross your fingers someone sees them. If feedback, then no, those are usually one-way -- I know tickets are generated internally because one of our CSMs actually shared them with me to help prioritize, but usually they are invisible with no followup (with one exception historically for API issues when Sarah Terry was there). The ticket ID for the last time I tried to get help on this is 107847 (last updated 7/23/2018). The SNMP info above came from a more recent interaction by someone else on my team -- not sure of the ticket ID on that one.
Mike_Rodrigues
Product Manager
4 years ago
Collector PM here. Quick glance at tickets suggests this ICMP firewall issue realization never made it to the collector team. I do recall the similar issue with SNMP.

I'll bring this up with the collector team tomorrow night. While best not to ping through a FW, if all we have to do to fix it is randomize the ICMP ID, that seems like a reasonable ask. I'm finding old TS tickets that appear to be this same issue, whether it was recognized at the time or not.

Sarah Terry is still here, by the way, she is Senior Director of Product these days.
Mike_Moniz
Professor
4 years ago
Try running a ping from the collector server to "VIPEIEEMP01" and see if that both results to an IP address and if ping works. Was this working at some point then stopped working or never worked?
Anonymous
4 years ago

20 minutes ago, mnagel said:

triggers firewalls to block traffic in some cases

I get and agree that LM is causing the case, but if the firewall is eventually blocking the traffic, the firewall blocking the legitimate traffic (that looks illegitimate). Do you know the FR number?
Anonymous
4 years ago
We're arguing semantics now, so i'll bow out.

As far as FR numbers, if you only put it here in the community, it likely didn't get entered into the system, so product didn't even know about it. If you spoke to your CSM about it, they would have put it into the system and good CSMs keep track of those entries. If you did it through the feedback system in the product, it would have made it into the system, but your CSM might not have seen it. Granted, the FR system needs a major overhaul. It's one of the big focuses of our upcoming focus on community (including a new hosting platform).
mnagel
Professor
4 years ago

9 minutes ago, mnagel said:

It is no longer legitimate traffic after the "session" is invalid -- this is common firewall behavior that would be avoided by using a fresh ICMP ID for each ping check, like everyone else does.

FR number? When did that start being a thing? I thought you just create them in the forum and cross your fingers someone sees them. If feedback, then no, those are usually one-way -- I know tickets are generated internally because one of our CSMs actually shared them with me to help prioritize, but usually they are invisible with no followup (with one exception historically for API issues when Sarah Terry was there). The ticket ID for the last time I tried to get help on this is 107847 (last updated 7/23/2018). The SNMP info above came from a more recent interaction by someone else on my team -- not sure of the ticket ID on that one.

I have been told "Other ticket numbers for the ping/SNMP issue are 286866 and the latest - 337366." Generally we are told to open a FR each time.
mnagel
Professor
4 years ago

2 minutes ago, Stuart Weenig said:

We're arguing semantics now, so i'll bow out.

As far as FR numbers, if you only put it here in the community, it likely didn't get entered into the system, so product didn't even know about it. If you spoke to your CSM about it, they would have put it into the system and good CSMs keep track of those entries. If you did it through the feedback system in the product, it would have made it into the system, but your CSM might not have seen it. Granted, the FR system needs a major overhaul. It's one of the big focuses of our upcoming focus on community (including a new hosting platform).

You can say it is semantics, but ICMP is connectionless and generally firewalls need to do inspection to identify sessions. For ICMP that is the ICMP ID that together with the src and dst address allow firewalls to allow an echo reply response after seeing the outgoing echo. An echo reply that does not match will be dropped since an unsolicited ICMP packet should not just be sent to targets. Because LM has this bug where it uses the same ICMP ID for all ping checks, it trips firewalls that do inspection. If you argue folks should not use firewalls internally, that is battling windmills -- it is very common and getting more so to limit lateral attacks. That each of our tickets has generated zero understanding and a punt to open a feature request is just sad. I am not going to get into the general abilities of our successive CSMs, but if you look at my first referenced you will see the way these things tend to end up.
Anonymous
4 years ago
How about this: you come to Elevate in June and we'll get some fluffy sumo wrestling suits on and duke it out? haha. We'll make one of the sales guys expense it. We'll collapse in the end and realize we're agreeing with each other.

Forum Discussion

Ping Failing from collector to device and back?

40 Replies

Recent Discussions

API v3: Alert ServiceNow field

Monitoring Azure Application Registrations

LM API Auth - Bearer or LMv1 Token

BMC Remedy Integration

LM / ServiceNow Integration