Anybody else disabling Meraki instances by default?

If you are, I’d like to know if you’re experiencing the problem we are that LM has tried and failed to troubleshoot:

For some random reason, a number of meraki instances disappear from LM during one discovery cycle and reappear during the next discovery cycle. This isn’t normally a big problem since they instances aren’t set to delete until 30 days. Normally, they’d just show back up and have a gap in data.

However, in our case we have a business need to have the instances disabled on discovery (we charge the customer per Meraki device we actively monitor). This means that instances that have been discovered and enabled for monitoring randomly disappear and reappear as disabled instances. Also, any customer instance level properties that were added to the instance also are not present on the rediscovered instance.

In the last 3 hours, there have been 3,577 instances (out of somewhere around 18,000) that disappeared and reappeared in this way. The problem was so pervasive that I had to write a script to loop through all instances and enable them based on a master billing sheet.

klamir
2 years ago
We experience similar issues - Meraki devices disappear at random during active discovery phases.
We monitor a couple of Meraki organizations with LM. The issue only seems to appear for one org - the one with more than 30 monitored networks. All other orgs we monitor with LM have 10 or less networks.
So it seems like the issue is related to the number of networks within an org. I decided to have a closer look at the API requests against Meraki during an active discovery phase where once again lots of devices disappeared from our portal.
Meraki offers you some great endpoints for investigating API requests, I used the one documented here: https://developer.cisco.com/meraki/api-latest/#!list-the-api-requests-made-by-an-organization
I evaluated a 20 mins window during which I was pretty sure active discovery was running. During this windows LM resp. its collector installed on one of our servers made 2000 requests. 700 (more than one third) of them failed, ending in a 429 - too many requests error.
So yes, there might be some exception handling / rate limiting checks into the core modules. But if you hit an API that hard, requests might be blocking each other. It seems like there is an issue with rate limit synchronization across discovery jobs – i.e. the more networks you monitor, the more discovery jobs are running in parallel and as a conclusion the more 429 errors you will get ending in randomly disappearing Meraki instances (because the instance couldn't be found in the Meraki response body).
We have a case open since April regarding this issue. I also shared my assumption with LM including several hundred failed Meraki API requests. Let's now wait and see.

11 Replies

mray
LM Conqueror
2 years ago
@klamir -- Can you please share with me (in a private DM) your support ticket number? I want to make sure we have full scope over which customers are currently experiencing this.
@Stuart Weenig -- I just picked up your follow-up on our previous case. We still have an active internal ticket with our Software Engineering team regarding this, so please allow me some time to see where we are and what we can do at this time.
I do want to emphasize for all who may be reading this post -- our core (LM created) Meraki DataSources are configured with auto-delete instances disabled (“Automatically Delete Instance”). Any change to this setting is considered custom and therefore unsupported by our Support and SWE teams. At this time, this is not a supported feature for these modules.
That said, as mentioned above, our SWE team is actively investigating potential improvements to our suite of Meraki LogicModules.
klamir
Neophyte
2 years ago
We experience similar issues - Meraki devices disappear at random during active discovery phases.
We monitor a couple of Meraki organizations with LM. The issue only seems to appear for one org - the one with more than 30 monitored networks. All other orgs we monitor with LM have 10 or less networks.
So it seems like the issue is related to the number of networks within an org. I decided to have a closer look at the API requests against Meraki during an active discovery phase where once again lots of devices disappeared from our portal.
Meraki offers you some great endpoints for investigating API requests, I used the one documented here: https://developer.cisco.com/meraki/api-latest/#!list-the-api-requests-made-by-an-organization
I evaluated a 20 mins window during which I was pretty sure active discovery was running. During this windows LM resp. its collector installed on one of our servers made 2000 requests. 700 (more than one third) of them failed, ending in a 429 - too many requests error.
So yes, there might be some exception handling / rate limiting checks into the core modules. But if you hit an API that hard, requests might be blocking each other. It seems like there is an issue with rate limit synchronization across discovery jobs – i.e. the more networks you monitor, the more discovery jobs are running in parallel and as a conclusion the more 429 errors you will get ending in randomly disappearing Meraki instances (because the instance couldn't be found in the Meraki response body).
We have a case open since April regarding this issue. I also shared my assumption with LM including several hundred failed Meraki API requests. Let's now wait and see.
Anonymous
2 years ago
Making the script was a bear, but only 264 lines. I might have made it simpler by doing PUTs instead of PATCHs, but I didn’t want to PUT for all ~1200 meraki devices. Instead, i pull the whole list from LM and compare with my system of record and only do PATCHs for the ones that are different.
David_Bond
Professor
2 years ago
LogicMonitor’s Meraki DataSources are frankly not fit for purpose when it comes to monitoring large Meraki estates.
For those that need this, please do check out our Meraki DataMagic product ( Google: meraki datamagic ). There is a free tier and an unlimited free month trial. All you need are an Azure AD account and a Meraki API key. Setup takes under a minute.
This gets all data into a staging database, ready for reporting in LogicMonitor. We provide free DataSources so that you can query that data.
Anonymous
2 years ago
I opened a ticket with support on December 1, 2022 and they couldn’t do between then and now what you appear to have done in a few days, with actual proof. Would you please bring this up with support? They adamantly denied any kind of 429 errors and also claimed that if any 429s were experienced that the discovery would exit unsuccessfully, thus causing LM to ignore the output of the discovery.
Since there is now proof that this is happening, perhaps they can reopen the case.
klamir
Neophyte
2 years ago
We are pretty aware that the "automatically delete instance" setting isn't supported until now. But how should we then keep track of all the changes that are made to our Meraki networks? Currently we monitor approx. 100 Meraki networks with LM, including more than 1200 access points - how can we make sure that we monitor "actual devices" rather than orphaned DataSource instances? We have similiar business needs as Stuart - to write a custom script for comparing LM monitored with actual Meraki devices and delete orphaned instances would be the last straw for us to fit our business needs.
nevertheless, we are glad to hear that you guys are actively trying to improve this modules 🙏
Mike_Moniz
Professor
3 years ago
In a previous life, I believe I saw something like that which was due to rate limits on Meraki API calls. With that many devices, perhaps you are also hitting the 64k (or so) char limit on BatchScript output?
Anonymous
3 years ago
Have had a case open since December 1 with LM Support. Their final response:
[We] really appreciate your continued patience while our teams thoroughly investigate. At this time, our Collector and Monitoring Engineering teams have reached a stopping point and there isn't anything further for them to investigate. The additional logging is showing that there is no issue with the core module behavior and we're processing the full response from the API with proper checks to ensure we've collected all data and paginated where necessary.
Supposedly they checked for pagination issues with the API, rate limiting, and I would assume batchscript output limits.
Shack
Advisor
3 years ago
I don’t know how you are setup but when this occurs is it happening on one Meraki device with a ton of instances or multiple devices with instances?
Anonymous
3 years ago
None of our meraki networks have that many devices. Couple hundred per network tops. Average is probably 30-50 per network.