Ping Failing from collector to device and back?



Show first post

40 replies

Userlevel 4
Badge +2

 @Joe Williams 

what size were your collectors and what did you end up adjusting the threadpool too? 

 

I’m new to LM and have been working on implementing it within my company the past few months. I’ve been facing this issue since December and have been going back and forth with LM Support. First, they had me do a fresh install, then they said there was a bug in v 33.x for the collectors and I needed to downgrade and the last step was to completely delete the collectors from the portal and add them again. At this point LM support claims they have done everything they can and have escalated to an internal ticket but I haven’t been able to receive any updates on the status of that ticket. I’m willing to make adjusts me to the threadpools/timeouts if needed to see if this resolved the issue for me.

Userlevel 7
Badge +11

I suppose it is possible there are two different issues -- threadpool change requirements needed indicates internal resources are exhausted for checks.  The original issue is still our constant problem -- static ID values cause sessions to become invalidated by firewalls when there is a disruption on the firewall path.  A new session must be observed by the firewall to start letting traffic through again, which can only be done (currently) by restarting the collector since the code allocates those session ID values only one at startup. LM has been informed about this repeatedly, is aware of it and does nothing to fix it.  The only thing I’ve seen is a doc note blaming particular firewalls, but it impacts pretty much any stateful inspection firewall. More and more folks use firewalls for internal segmentation and there are many cases where a remote collector is needed due to lack of resources to deploy a local collector.

Userlevel 6
Badge +13

 @Joe Williams 

what size were your collectors and what did you end up adjusting the threadpool too? 

For us it was a careful dance of watching the graphs and the debug commands. I can’t tell you what to adjust yours to as each collector is unique.

Userlevel 4
Badge +7

Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

Hello,

It is a Windows Server 2019 Standard for the client VIPEIEEMP01, Windows Server 2016 Standard for the “preferred” collector VIPLGCMON02. 

The Client has an external IP, so it is attached to the closest Datacenter for its collector.

Thanks,

Dom

Userlevel 4
Badge +7

 @Joe Williams 

what size were your collectors and what did you end up adjusting the threadpool too? 

 

I’m new to LM and have been working on implementing it within my company the past few months. I’ve been facing this issue since December and have been going back and forth with LM Support. First, they had me do a fresh install, then they said there was a bug in v 33.x for the collectors and I needed to downgrade and the last step was to completely delete the collectors from the portal and add them again. At this point LM support claims they have done everything they can and have escalated to an internal ticket but I haven’t been able to receive any updates on the status of that ticket. I’m willing to make adjusts me to the threadpools/timeouts if needed to see if this resolved the issue for me.

Hello,

Windows Server 2016 Standard

CPU:                    Quad 2.40 Ghz Intel Xeon Silver 4214R

RAM:                   16 GB

Disk:                     75 GB

Collector Version: 32.003

Let me know what I should do to adjust the threadpool? 

Thanks,

Dom

Userlevel 4
Badge +7

I suppose it is possible there are two different issues -- threadpool change requirements needed indicates internal resources are exhausted for checks.  The original issue is still our constant problem -- static ID values cause sessions to become invalidated by firewalls when there is a disruption on the firewall path.  A new session must be observed by the firewall to start letting traffic through again, which can only be done (currently) by restarting the collector since the code allocates those session ID values only one at startup. LM has been informed about this repeatedly, is aware of it and does nothing to fix it.  The only thing I’ve seen is a doc note blaming particular firewalls, but it impacts pretty much any stateful inspection firewall. More and more folks use firewalls for internal segmentation and there are many cases where a remote collector is needed due to lack of resources to deploy a local collector.

Hello,

Yes I think the collector are overloaded as all of them for the datacenter implied here are over the Load Balancing settings which would mean there is no more balancing anymore !!!

Thanks,

Dom

Userlevel 4
Badge +7

Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

Hello,

I would have to check with the Linux Team as I have only Windows Server 2016 Collectors on my side.

Thanks,

Dom

Userlevel 4
Badge +7
1 hour ago, Mike Moniz said:

have collectors located on the same network segment as the devices being monitored so not hitting firewalls

This is the recommended architecture.

1 hour ago, mnagel said:

most organizations these days are moving to internal compartmentalization, which means firewalls of some sort

If your firewalls are blocking legitimate business traffic, they need to not do that.

Hello,

Yes we are trying to have the collectors and their client on the same segment 

Yes we have multiple Firewalls all through the Network not on the client itself…  I will recheck them as the issue does not seem to be wide spread even through a group of servers within a common application which typically are all on the same subnet and have the same firewalls.

Thanks,

Dom

Userlevel 6
Badge +13

Adjusting the threadpools isn’t something to be done lightly. We have a few standards around them in our deployments based on collector size, but each collector is its own thing. I am hesitant to say what we even do as it probably isn’t the right thing for others to do.

I suggest adding the collector into monitoring and watching the collector graphs. Look at the queue depth, tasks failing, etc. And see where it makes sense to adjust things.

We have also noticed, even if a collector is memory starved, most times, it is better served with giving it more vcpu if possible.

Userlevel 7
Badge +20

Yes I think the collector are overloaded as all of them for the datacenter implied here are over the Load Balancing settings which would mean there is no more balancing anymore !!!

Auto balanced collector group (ABCG) does not mean load balancing. It is overload redistribution. 

Example: Two collectors with rebalance threshold of 10,000. One collector has 9,999, the other has 5. No rebalancing happens in this case.

When/if the first collector gets more than 10,000, the device with the highest number of instances will be reassigned to a different collector. If that brings the count back below 10,000, rebalancing stops. 

Example: Two collectors with rebalance threshold of 10,000. One collector has 10,001, the other has 5. Largest device on the first collector has 200 instances. Rebalancing happens. The first collector ends up with 9,801, the other has 205.

ABGC != Load Balancing.

Userlevel 7
Badge +11

I am pleased to announce that LM has (after nearly 5 years of back-and-forth -- my first attempt to get this addressed was in June 2018) finally has fixed both the SNMP and ping issues impacted by intermediate firewall session invalidation -- update from support last week:

 

Our development team has acknowledged the issues you outlined with Ping. Currently the behavior is to have cached sessions for ICMP ping and then reuse them, only refreshing the cache on sbproxy restart. An alternative has been in development and will be fixed in the next EA release. Similar issues with SNMP have been addressed already in EA 34.100.

 

Hopefully this is actually the case, but if so it will be very nice to tell our clients this longtime bug has finally been quashed.

Userlevel 7
Badge +11

I am pleased to announce that LM has (after nearly 5 years of back-and-forth -- my first attempt to get this addressed was in June 2018) finally has fixed both the SNMP and ping issues impacted by intermediate firewall session invalidation -- update from support last week:

 

Our development team has acknowledged the issues you outlined with Ping. Currently the behavior is to have cached sessions for ICMP ping and then reuse them, only refreshing the cache on sbproxy restart. An alternative has been in development and will be fixed in the next EA release. Similar issues with SNMP have been addressed already in EA 34.100.

 

Hopefully this is actually the case, but if so it will be very nice to tell our clients this longtime bug has finally been quashed.

 

So I’ve had some time now on EA 34.300 with one of our “problem children” and I am saddened to report the SNMP issues have not been addressed, at least not sufficiently.  What I have observed during a spate of recent ISP disruptions for monitoring of a remote site (via IPSec tunnel) is that LogicMonitor eventually seems to figure it out and will begin collecting data, but it takes roughly 2 hours. Having 2 hour gaps is better than indefinite gaps, but it is still unacceptable.

Userlevel 7
Badge +8

ICMP is a terrible way for the Collector to determine “Host Dead” status.  Better would be if any DataSource is able to collect data from it.  If data can be collected, the host isn’t dead.

Userlevel 7
Badge +11

ICMP is a terrible way for the Collector to determine “Host Dead” status.  Better would be if any DataSource is able to collect data from it.  If data can be collected, the host isn’t dead.

ICMP itself seems to be fine now, actually.  The problem that persists is SNMP when an intermediate stateful inspection engine (firewall) invalidates sessions.  UDP is stateless, but SNMP uses a session ID most modern firewalls recognize.  Once the session ID is broken, LM stops working since the developers chose to blindly use the same session ID indefinitely. My guess is with the new collector code they periodically refresh the session ID so it eventually recovers rather than trigger a new session after a failed poll or two. The right way is very often not the way these developers roll, sadly,

Userlevel 7
Badge +20

ICMP is a terrible way for the Collector to determine “Host Dead” status.  Better would be if any DataSource is able to collect data from it.  If data can be collected, the host isn’t dead.

That actually is close to how it works now. ICMP does reset the idleInterval datapoint (or whatever the internal flag is), which is what determines host down status. However, it’s not the only thing. Any datasource that can be trusted to actually get a reply from a device should reset the idleInterval datapoint. This includes any SNMP datasources, website/http datasources, etc. It does not include scripted datasources. The thinking there is that a scripted datasource might be contacting a 3rd party system to collect data and not actually getting an actual response from the device itself. So, anything that is guaranteed to return data from the device itself should reset the idle interval counter.

The bigger feature request here is that customers need a way to modify/override the built in criteria for considering a device down. For some people, pingability is enough. For others, it needs to be pingable and responding to some other query. The customers need the ability to determine (on the device, group, and global levels) what constitutes a device being down. For example, i would need to be able to say that ping has to be up, but also x/y of these other datasources must also be returning data.

Reply