Forum Discussion
I had similar issues onboarding a UniFi controller a while ago. This controller was managing over a thousand devices across a couple of hundred sites.
Turned out that the MongoDB backend of the controller was too slow to respond due to the sheer amount of data that was being queried by the LM collector (this was the MongoDB instance that was bundled together with the Unifi installer, all on the same machine). This led to random gaps and massive delays in autodiscovery (and I had the same, if not similar errors when testing the datasources).
The way I found this was by using the CLI to make the same call, and it would just time out due to a lack of response from the controller. I repeated the process while monitoring the processes and logs directly on the controller, on noticed that MongoDB was struggling to respond with the request dataset (due to size).
To fix this, I had to redeploy the Unifi controller to a new server with a separate server running the MongoDB database. I did some research into MongoDB deployments and made sure I followed some documented best practices around storage and filesystem types. Once the servers were deployed, I migrated the collector configuration over just using the standard backup and restore process that Unifi provide.
After that ordeal, the controller's overall performance significantly improved, and I no longer had any issues with the LM Unifi Datasources.
I recommend that you start looking at the controller's mongo instance to see if there are issues there. Not saying you're having the same issue, but I thought I would share my experience.
This issue took me a long time to solve, and I ended up writing up a deployment guide in my KB if you're interested.
I am interested in seeing that KB guide if you're willing to share. Great rundown of the issue. Perhaps reducing the DS sample frequencies may help alleviate some of the issue for now. We're an MSP, so we're monitoring and taking action on alerts. We'll be able to advise the client to change any configurations/architecture, but not make any direct changes they don't approve.
I'm going to keep fiddling with the DataSources and their AppliesTo to see if I can reduce the amount of traffic... I've already done so by changing the AP one:
from:
hasCategory("Ubiquiti_Unifi") || (unifi.user && unifi.pass)
to:
hasCategory("Ubiquiti_Unifi") || (unifi.user && unifi.pass)
&& hasCategory("UbiquitiUnifiAP")
There are PropertySources that are tagging devices, but those properties aren't being used in the appliesTo to reduce the amount of Active Discovery queries against devices that wouldn't respond to them correctly anyway, causing timeouts rather than returns.
- Cole_McDonald26 days ago
Professor
Found a good reference for OIDs to help differentiate between device types... the AP PropertySource is already using this to do so: mibs.observium.org/mib/UBNT-MIB/
AP is looking for "1.3.6.1.4.1.41112.1.6.3.3.0"
I may clone the AP PropertySource and add a switch/case statement to find/select other device types to tag so they can be used in the appliesTo for the other PropertySources. - Cole_McDonald26 days ago
Professor
Findings... I can just reference system.sysoid in the appliesTo for the appropriate DS's:
- Switch: 1.3.6.1.4.1.4413
- Gateway: 1.3.6.1.4.1.8072.3.2.10
- Access Point: 1.3.6.1.4.1.41112
These are pulled in the initial SNMP grab, so should be able to be utilized without any extra scripting necessary.
- Cole_McDonald26 days ago
Professor
I've also found a note in the description of the DSs indicating that past a certain point, the API won't work for them:
If anyone on the module dev team is looking in here, is this on the roadmap to get updated?