Forum Discussion

Anonymous's avatar
Anonymous
2 years ago

Anybody else disabling Meraki instances by default?

If you are, I’d like to know if you’re experiencing the problem we are that LM has tried and failed to troubleshoot:

For some random reason, a number of meraki instances disappear from LM during one discovery cycle and reappear during the next discovery cycle. This isn’t normally a big problem since they instances aren’t set to delete until 30 days. Normally, they’d just show back up and have a gap in data.

However, in our case we have a business need to have the instances disabled on discovery (we charge the customer per Meraki device we actively monitor). This means that instances that have been discovered and enabled for monitoring randomly disappear and reappear as disabled instances. Also, any customer instance level properties that were added to the instance also are not present on the rediscovered instance.

In the last 3 hours, there have been 3,577 instances (out of somewhere around 18,000) that disappeared and reappeared in this way. The problem was so pervasive that I had to write a script to loop through all instances and enable them based on a master billing sheet. 

  • We experience similar issues - Meraki devices disappear at random during active discovery phases.

    We monitor a couple of Meraki organizations with LM. The issue only seems to appear for one org - the one with more than 30 monitored networks. All other orgs we monitor with LM have 10 or less networks.

    So it seems like the issue is related to the number of networks within an org. I decided to have a closer look at the API requests against Meraki during an active discovery phase where once again lots of devices disappeared from our portal.

    Meraki offers you some great endpoints for investigating API requests, I used the one documented here: https://developer.cisco.com/meraki/api-latest/#!list-the-api-requests-made-by-an-organization

    I evaluated a 20 mins window during which I was pretty sure active discovery was running. During this windows LM resp. its collector installed on one of our servers made 2000 requests. 700 (more than one third) of them failed, ending in a 429 - too many requests error.

    So yes, there might be some exception handling / rate limiting checks into the core modules. But if you hit an API that hard, requests might be blocking each other. It seems like there is an issue with rate limit synchronization across discovery jobs – i.e. the more networks you monitor, the more discovery jobs are running in parallel and as a conclusion the more 429 errors you will get ending in randomly disappearing Meraki instances (because the instance couldn't be found in the Meraki response body).

    We have a case open since April regarding this issue. I also shared my assumption with LM including several hundred failed Meraki API requests. Let's now wait and see.

  • Anonymous's avatar
    Anonymous

    I’m not experiencing this on a completely unrelated custom datasource. I built this ds because we needed a single DS that reported on collector disk usage whether or not we were monitoring the collector via ssh/wmi, so it only applies to collectors. Here’s the discovery script:

    try {
    def sout = new StringBuilder(), serr = new StringBuilder()
    if (hostProps.get("system.collectorplatform") == "windows"){
    def proc = 'powershell -command "Get-CimInstance Win32_LogicalDisk | Where{$_.DriveType -eq 3} | Format-Table -Property DeviceID, Size, FreeSpace -HideTableHeaders"'.execute()
    proc.consumeProcessOutput(sout, serr)
    proc.waitForOrKill(1000)
    // println("stdout: ${sout}")
    sout.eachLine{
    splits = it.tokenize(" ")
    if (splits[0]){
    println("${splits[0].replaceAll(':','')}##${splits[0].replaceAll(':','')}")
    }
    }
    return 0
    } else {
    def proc = 'findmnt -D -t nosquashfs,notmpfs,nodevtmpfs'.execute()
    proc.consumeProcessOutput(sout, serr)
    proc.waitForOrKill(1000)
    sout.eachLine{
    splits = it.tokenize(" ")
    if (splits[0] != "SOURCE"){
    println("${splits[0]}##${splits[6]}######type=${splits[1]}&size=${splits[2]}")
    }
    }
    return 0
    }
    } catch (Exception e){
    println(e)
    return 1
    }

    I can’t see any reason the instances would not show up, especially given that the hardware hadn’t changed.

    I know for a fact that the instance disappeared and reappeared. For one, there was a no data alert that opened a ticket in my ticket system yesterday. The collection script didn’t gracefully handle a “-” in a place where it was expecting a number. So, i disabled the instance. When i click on the link to go to the alert today, it shows the alert (as cleared). However, if i try to drill into the instance from the alert, i’m taken to the root of the account (this is what happens when the target of a link doesn’t exist anymore). 

    However, if i navigate to the device, i can see the instance (enabled!) and can see no historical data before 9pm last night. 

    I’ve had a case open with support since last year about this behavior on Meraki but they eventually closed out the ticket saying the new modules might fix the issues.

    I guess I’m just saying, keep an eye out for any instances that disappear and reappear.