Forum Discussion

mmatec01's avatar
mmatec01
Icon for Neophyte rankNeophyte
3 years ago

Collectors and TCP Ephemeral ports exhaustion detection

Lately, we started experiencing this nasty issue with our collectors, whereby the collector runs out of all available ephemeral ports.  When that happens all communication basically grinds to a halt including DNS lookup, domain authentication, WMI outbound calls (meaning there will be NO collection for Windows servers), etc, etc... Usually when this happens I see hundreds of Warn - XXXXXX System Uptime SystemUpTime No Data alerts filling up my inbox, since I configured for No Data alert.  Fix - albeit temporary, is to reboot the collector, and thus immediately reclaim the resources.
Now, while the subject of why this is happening is very important and something I am definitely looking into and doing bunch of research with and without vendor support, my question today is more practical.  What can be done to monitor and "forecast", if you will, that your collector is about to go dead because you ran out of TCP ports?
I mean I can look at TCP stats DataSource collection values and monitor Connections, TCP Failed Connections, Segments per seconds all of these are fine to monitor but they don't tell me something is about to happen as their values skyrocket AFTER the fact.  Similarly, there is hundreds of metrics under Collector DataSources but I am at loss which one(s) to look at and set alerts on.
Is there something, like running netstat, looking at number of handles per process in Task Manager, or runnig some command which I can scipt and programmatically capture output that speaks to the issue at hand?

  • I have also reduced sample frequency or turned off some of the noisier pieces of WMI... clusters, hyper-v, msmq, and .NET  These all combined made the environments I'd maintained much more reliable.  If you have a 15 minute response time expectation/SLA, you can generally shift to 5 minutes sample time rather than 1 or 3... this will still allow 3 consecutive threshold crossings before producing an alert without impacting your response times.

  • I apologize for missing this when you first posted.  Here's the MS doc I found that helped alleviate this issues in all of the environments I've deployed LM.  Windows defaults to a 3min timeout for the time_wait rather than issuing an explicit final packet.  This causes port exhaustion on many collectors.  I've ended up shifting the ephemeral range back to the XP values (nearly triple current) and shortened the timeout to 30s rather than 3m (just on the collector vm/server) allowing it to "fail faster"

    I also used "netstat -qan" to gather connection data and sliced it up using select-string and -split to find the connection counts that fall within the emphemeral port range that I gather using "netsh int ipv4 show dynamicport tcp".  Generally, anything over 10-15% utilization would start showing cracks in the data as the transition from established to wait to closed allowed them to back up.

    I also built a TCP Port Utilization dataSource to keep track of these and alert to issues as they start to arise.  That isn't publicly accessible.