Anybody else disabling Meraki instances by default?
If you are, I’d like to know if you’re experiencing the problem we are that LM has tried and failed to troubleshoot:
For some random reason, a number of meraki instances disappear from LM during one discovery cycle and reappear during the next discovery cycle. This isn’t normally a big problem since they instances aren’t set to delete until 30 days. Normally, they’d just show back up and have a gap in data.
However, in our case we have a business need to have the instances disabled on discovery (we charge the customer per Meraki device we actively monitor). This means that instances that have been discovered and enabled for monitoring randomly disappear and reappear as disabled instances. Also, any customer instance level properties that were added to the instance also are not present on the rediscovered instance.
In the last 3 hours, there have been 3,577 instances (out of somewhere around 18,000) that disappeared and reappeared in this way. The problem was so pervasive that I had to write a script to loop through all instances and enable them based on a master billing sheet.
We experience similar issues - Meraki devices disappear at random during active discovery phases.
We monitor a couple of Meraki organizations with LM. The issue only seems to appear for one org - the one with more than 30 monitored networks. All other orgs we monitor with LM have 10 or less networks.
So it seems like the issue is related to the number of networks within an org. I decided to have a closer look at the API requests against Meraki during an active discovery phase where once again lots of devices disappeared from our portal.
Meraki offers you some great endpoints for investigating API requests, I used the one documented here: https://developer.cisco.com/meraki/api-latest/#!list-the-api-requests-made-by-an-organization
I evaluated a 20 mins window during which I was pretty sure active discovery was running. During this windows LM resp. its collector installed on one of our servers made 2000 requests. 700 (more than one third) of them failed, ending in a 429 - too many requests error.
So yes, there might be some exception handling / rate limiting checks into the core modules. But if you hit an API that hard, requests might be blocking each other. It seems like there is an issue with rate limit synchronization across discovery jobs – i.e. the more networks you monitor, the more discovery jobs are running in parallel and as a conclusion the more 429 errors you will get ending in randomly disappearing Meraki instances (because the instance couldn't be found in the Meraki response body).
We have a case open since April regarding this issue. I also shared my assumption with LM including several hundred failed Meraki API requests. Let's now wait and see.