Ping Failing from collector to device and back?


Userlevel 4
Badge +7
Hello I am getting errors:“VIPEIEEMP01 is suffering ping loss. 100.0% of pings are not returning, placing the host into critical state.”
“The host VIPEIEEMP01 is down. No data has been received”
 

 If I do a ping from the command prompt of the Client to the Collector  it works

The firewall is wide opened!!!

What did I miss?

Thanks,

Dom


40 replies

Userlevel 6
Badge +11

Try running a ping from the collector server to "VIPEIEEMP01" and see if that both results to an IP address and if ping works. Was this working at some point then stopped working or never worked?

 

Userlevel 7
Badge +11
6 minutes ago, Dominique said:
Hello I am getting errors:“VIPEIEEMP01 is suffering ping loss. 100.0% of pings are not returning, placing the host into critical state.”
“The host VIPEIEEMP01 is down. No data has been received”
 

 If I do a ping from the command prompt of the Client to the Collector  it works

The firewall is wide opened!!!

What did I miss?

Thanks,

Dom

This is an old bug.  I tried (and keep trying) to get it fixed for years, but no luck so far.  The problem is that any firewall keeps a session table.  ping is not session-based like TCP, but firewalls still keep track of ICMP ID and use timers to invalidate sessions.  Same for UDP (SNMP).  The LM collector code is "lazy" and does not generate new session-equivalents for successive checks, so eventually the traffic is dropped by the firewall because it is matched to an invalid session.  You can workaround this by restarting the collector.  For SNMP they have recently added some knobs in the collector config to help, but for ICMP it is still messed up.

Userlevel 6
Badge +11

Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

Userlevel 7
Badge +11
2 minutes ago, Mike Moniz said:

Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

The world is passing LM by there -- most organizations these days are moving to internal compartmentalization, which means firewalls of some sort.  We have generally seen it for smaller remote sites that have no desire or facilities for a local collector.  I don't think it matters what the collector platform is since the checks are all the same (Java/Groovy).  I tried a lot with our CSM back in 2018 and while they agreed in principal, it was considered a "feature request" and well, we know where those generally end up. I was also treated to circular logic (determined to be a problem with Palo Alto firewalls only, which is completely untrue, and evidence was an LM support page saying that Palo Altos have that problem).

1 hour ago, Mike Moniz said:

have collectors located on the same network segment as the devices being monitored so not hitting firewalls

This is the recommended architecture.

1 hour ago, mnagel said:

most organizations these days are moving to internal compartmentalization, which means firewalls of some sort

If your firewalls are blocking legitimate business traffic, they need to not do that.

Userlevel 7
Badge +11
5 minutes ago, Stuart Weenig said:

This is the recommended architecture.

If your firewalls are blocking legitimate business traffic, they need to not do that.

Folks are not going to place collectors in every subnet and due to increased security concerns, there will be more and more situations where this will be an issue.

As far as "blocking legitimate traffic" that is not what is happening here (OP specifically said the firewall was wide open).  It is allowing the traffic, but firewalls track sessions and due to bad programming, LM triggers firewalls to block traffic in some cases.  For example, we had a remote location (all WAN sites transit firewalls, a very common architecture) that had suffered a power outage.  Pings began failing because LM reuses the same ICMP ID forever and the original session established previously was no longer valid.

As I mentioned, I escalated this to our CSM in 2018 and got back "I get it, but you need to open a feature request".  Since then, someone in LM has at least figured out this needs support for SNMP -- this is what we were provided and it works fairly well (that seems to only target SNMPv3, but we tend to use that when possible so it is OK).

By default, the collector does not change the SNMP library session until a collector restart, which is why that resolves the issue. You may be able to work around this by adjusting the following fields in the collector debug.

snmp.shareThreads.impl.v3.switchport.enable=true
snmp.shareThreads.impl.v3.initialCheckDelay.minutes=3

 

We get frequent annoyed tickets from clients who are told by LM that a host is not responding to ping, which is trivially proved wrong by them.  Our only solution when it happens is to restart the collector.  You know, rather than LM fixing broken code.

20 minutes ago, mnagel said:

triggers firewalls to block traffic in some cases

I get and agree that LM is causing the case, but if the firewall is eventually blocking the traffic, the firewall blocking the legitimate traffic (that looks illegitimate). Do you know the FR number?

Userlevel 7
Badge +11
19 minutes ago, Stuart Weenig said:

I get and agree that LM is causing the case, but if the firewall is eventually blocking the traffic, the firewall blocking the legitimate traffic (that looks illegitimate). Do you know the FR number?

It is no longer legitimate traffic after the "session" is invalid -- this is common firewall behavior that would be avoided by using a fresh ICMP ID for each ping check, like everyone else does.  

FR number?  When did that start being a thing?  I thought you just create them in the forum and cross your fingers someone sees them.  If feedback, then no, those are usually one-way -- I know tickets are generated internally because one of our CSMs actually shared them with me to help prioritize, but usually they are invisible with no followup (with one exception historically for API issues when Sarah Terry was there). The ticket ID for the last time I tried to get help on this is 107847 (last updated 7/23/2018).  The SNMP info above came from a more recent interaction by someone else on my team -- not sure of the ticket ID on that one.

We're arguing semantics now, so i'll bow out.

As far as FR numbers, if you only put it here in the community, it likely didn't get entered into the system, so product didn't even know about it. If you spoke to your CSM about it, they would have put it into the system and good CSMs keep track of those entries. If you did it through the feedback system in the product, it would have made it into the system, but your CSM might not have seen it. Granted, the FR system needs a major overhaul. It's one of the big focuses of our upcoming focus on community (including a new hosting platform). 

Userlevel 7
Badge +11
9 minutes ago, mnagel said:

It is no longer legitimate traffic after the "session" is invalid -- this is common firewall behavior that would be avoided by using a fresh ICMP ID for each ping check, like everyone else does.  

FR number?  When did that start being a thing?  I thought you just create them in the forum and cross your fingers someone sees them.  If feedback, then no, those are usually one-way -- I know tickets are generated internally because one of our CSMs actually shared them with me to help prioritize, but usually they are invisible with no followup (with one exception historically for API issues when Sarah Terry was there). The ticket ID for the last time I tried to get help on this is 107847 (last updated 7/23/2018).  The SNMP info above came from a more recent interaction by someone else on my team -- not sure of the ticket ID on that one.

I have been told "Other ticket numbers for the ping/SNMP issue are 286866 and the latest - 337366."  Generally we are told to open a FR each time.

Userlevel 7
Badge +11
2 minutes ago, Stuart Weenig said:

We're arguing semantics now, so i'll bow out.

As far as FR numbers, if you only put it here in the community, it likely didn't get entered into the system, so product didn't even know about it. If you spoke to your CSM about it, they would have put it into the system and good CSMs keep track of those entries. If you did it through the feedback system in the product, it would have made it into the system, but your CSM might not have seen it. Granted, the FR system needs a major overhaul. It's one of the big focuses of our upcoming focus on community (including a new hosting platform). 

You can say it is semantics, but ICMP is connectionless and generally firewalls need to do inspection to identify sessions.  For ICMP that is the ICMP ID that together with the src and dst address allow firewalls to allow an echo reply response after seeing the outgoing echo.  An echo reply that does not match will be dropped since an unsolicited ICMP packet should not just be sent to targets.  Because LM has this bug where it uses the same ICMP ID for all ping checks, it trips firewalls that do inspection. If you argue folks should not use firewalls internally, that is battling windmills -- it is very common and getting more so to limit lateral attacks.  That each of our tickets has generated zero understanding and a punt to open a feature request is just sad.  I am not going to get into the general abilities of our successive CSMs, but if you look at my first referenced you will see the way these things tend to end up.

How about this: you come to Elevate in June and we'll get some fluffy sumo wrestling suits on and duke it out? haha. We'll make one of the sales guys expense it. We'll collapse in the end and realize we're agreeing with each other.

Userlevel 5
Badge +4

Collector PM here. Quick glance at tickets suggests this ICMP firewall issue realization never made it to the collector team. I do recall the similar issue with SNMP.

I'll bring this up with the collector team tomorrow night. While best not to ping through a FW, if all we have to do to fix it is randomize the ICMP ID, that seems like a reasonable ask. I'm finding old TS tickets that appear to be this same issue, whether it was recognized at the time or not.


Sarah Terry is still here, by the way, she is Senior Director of Product these days.

Userlevel 7
Badge +11
41 minutes ago, Michael Rodrigues said:

Collector PM here. Quick glance at tickets suggests this ICMP firewall issue realization never made it to the collector team. I do recall the similar issue with SNMP.

I'll bring this up with the collector team tomorrow night. While best not to ping through a FW, if all we have to do to fix it is randomize the ICMP ID, that seems like a reasonable ask. I'm finding old TS tickets that appear to be this same issue, whether it was recognized at the time or not.


Sarah Terry is still here, by the way, she is Senior Director of Product these days.

 

Thank you for jumping in!  I did not have any idea where to acquire a Sumo suit :).

Userlevel 5
Badge +4

Still working on this, hoping to have a better update soon.

Userlevel 5
Badge +5
On 4/1/2022 at 3:18 PM, Mike Moniz said:

Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?


We've already faced this internally with some customers too (where having 1 collector per subnet is not feasible - due to the size, licensing, etc...).
If I'm not mistaken the OS is irrelevant to this issue (since it's the way LM is working currently).

Userlevel 2
Badge +3

Any further updates here?

I'm also facing the same issue for Palo Alto FW.

Any updates on this? ?

Userlevel 7
Badge +11

I would not hold my breath.  I pushed my CSM at the time on this issue back in 2018 and they refused to consider it a bug, but a feature request.  I also brought that up recently with my current CSM and I got crickets.  I dutifully followed through, but since the feature request "system" is nearly worthless, nothing has been done.  I cannot begin to enumerate the number of embarrassing conversations with clients that start like "Why is LogicMonitor alarming about SNMP being down or a host being down when we can get to those devices just fine?"  The workaround is time-intensive (manual collector restarts) and the repeated data loss is unforgivable.  I don't know what possessed the developers to generate a fixed SNMP session ID or ICMP ID once when the collector starts rather than at each new get/walk or ping.  It is the ultimate in false optimization and causes unending problems for any but the smallest simplest network.  LM should be ashamed of letting this continuing to happen.

Userlevel 3
Badge +6
On 12/12/2022 at 12:23 PM, mnagel said:

I would not hold my breath.  I pushed my CSM at the time on this issue back in 2018 and they refused to consider it a bug, but a feature request.  I also brought that up recently with my current CSM and I got crickets.  I dutifully followed through, but since the feature request "system" is nearly worthless, nothing has been done.  I cannot begin to enumerate the number of embarrassing conversations with clients that start like "Why is LogicMonitor alarming about SNMP being down or a host being down when we can get to those devices just fine?"  The workaround is time-intensive (manual collector restarts) and the repeated data loss is unforgivable.  I don't know what possessed the developers to generate a fixed SNMP session ID or ICMP ID once when the collector starts rather than at each new get/walk or ping.  It is the ultimate in false optimization and causes unending problems for any but the smallest simplest network.  LM should be ashamed of letting this continuing to happen.

I've pushed our CSM a good deal on infrequent SNMP failures, and I was able to hop on a Zoom call with their Devs and walk them through the exact behavior we were seeing (I keyed on this by virtue of alerts for SNMP host down, despite 'Poll Now' clearly showing a response), and I recently received some tacit confirmation that this 'bug' was, indeed, acknowledged. I wasn't aware of the exact cause of the problem, but was able to clearly demonstrate to them that this presents as a bug, clearly not a 'feature' request.

Obviously, no indication of timeline for fix, but this seemed like something that finally landed -- Hopefully this will lead to a fundamental fix. I'll be sure to highlight this thread to our CSM as well, in case that might help at all.

Userlevel 6
Badge +11

I looks like I’m seeing the same issues now with some customers where ping stops working until the collector is restarted. So add me to the list of affected people.

Userlevel 3
Badge +8

This behavior seems like it may be related to, or overlap with, an issue I first observed in January 2020 (I think) and was never able to resolve.

Randomly, subsets of our Juniper switches (and only switches, no other devices) would trip alerts indicating 100% ping loss.  It would usually auto-resolve after 60-90 minutes and never left evidence behind--that I could find--why the condition started or cleared up.

During the time the alerts were in effect, I had other non-collector sources of pings to the same switches that were not disrupted and I could ping back to the collectors involved from the switch command line. SSH, SNMP, other communication between collectors and switches showed no problem.

Of note, none of the traffic between collectors and switches traversed a firewall.

The real kicker to me was that I had never seen this behavior until I upgraded collectors to 29.003.  If I rolled collectors back to 28.x, the issue did not occur.  As soon as I pushed forward again to 29.x it started happening again.   I opened a case with support and I spent a lot of tedious time trying to figure out where traffic was getting dropped to no avail; after several months I was not able to convince them to move from their “its something in your environment” stance.   As much as I wanted an answer I simply could not afford to devote the time needed to sustain an investigation.

Ultimately I applied system.category “NoPing” to switches and moved on.

Userlevel 5
Badge +11

We have observered similar issues with Ping failing via the collector application but not from the OS it self. A restart would resolve the issue for awhile. What we found was adjusting the threadpool for Ping to fix our problem. It seems the collector would max out of threads for ping and then just start failing. After adjusting the threadpool count we didn’t encounter this issue again.

Userlevel 7
Badge +11

We have observered similar issues with Ping failing via the collector application but not from the OS it self. A restart would resolve the issue for awhile. What we found was adjusting the threadpool for Ping to fix our problem. It seems the collector would max out of threads for ping and then just start failing. After adjusting the threadpool count we didn’t encounter this issue again.

Are you certain that fixed the issue? Because that would require a collector restart and that is what fixes the problem since it forces generation of new ID values for ICMP and SNMP “sessions”.

This problem has been going on for years and LM seems to have no plan to fix it. We routinely lose hours of data due to intermediate firewall session invalidation and I’ve seen only a glimmer of interest from folks at LM. The collector code needs to be updated to generate new ICMP and SNMP “sessions” for each check, or at least do so periodically (e.g., every 5-10 minutes) so this stops happening.

Userlevel 5
Badge +11

@mnagel For us it did resolve it. We had consistent collectors and devices that this happened with. After adjusting the threadpools, those collectors and devices didn’t show the issue.

Reply