Anonymous
5 months agoCollector v36
Make the decision carefully to upgrade to v36 or above of the collector. We've experienced some issues with stability.
The other day you mentioned the !tlist errors, which seemed to be hit-or-miss, because at least for me it wasn't the case.
But I'm guessing you've found other things. Anything else to watch out for?
We're seeing gaps in device data for all devices on a single collector. They happen at random times and on random collectors (all v36 or above). When it happens, santaba thinks the collector is up so it doesn't issue a collector down state. But the devices being monitored don't have data for up to 15 minutes, triggering a host status::idleinterval alert for every single device on that collector.
Any particular DataSource? Or is it every DataSource at the same time during the nodata period?
No particular datasource. Appears to be a problem where the collector is sending its heartbeat to santaba, but it's not sending device metrics to santaba.
One followup out of curiosity, how frequent do these 15 minute gaps happen? Is it once or twice a day? Or every couple of hours?
On our portal, all our collectors have 36.000 with a couple of exceptions, and I havent noticed this behavior or seen any alert total spikes. I see the usual Network Interfaces LogicModule with nodata hiccups on the regular, even on collectors whose batchscript status it normal. But I havent seen large nodata or idleinterval data gaps.
But I WILL keep an eye out.