Forum Discussion

mnagel's avatar
Icon for Professor rankProfessor
11 months ago

PSA: LM wipes good known properties when unknown results occur

I have recently found that due to the excellent programming skills in the dev team that properties that have previously been autodiscovered can be wiped out when ephemeral issues produce unknown (no data) results. A good example is system.ips -- if the data has been scanned properly in the past and a blip occurs with no data, the previous values get overwritten with just the configured IP of the device. That leads to various fun side effects like NetFlow data not being matched to the device. To make things worse, the “no data” result does not set an internal flag to run a new AD scan earlier and you have to wait up to 24 hours for a regularly scheduled scan.

I created a bug ticket requesting they set that flag and run a new scan as soon as possible, but was basically told to pound sand. My workaround was to use an undocumented API endpoint to trigger on specified devices so I stop losing NetFlow data and I scheduled it hourly.  The “solution” I was given was to add a netflow property to hardcode the needed IP address for each device -- works, but it is a brittle fix and leads to undesirable manual property management.  Beyond that, this issue affects more than NetFlow, that was just the problem that lead me to realize what was happening.  Other properties routinely get messed up that could affect processing.

This class of problem (replacing good data with unknown data) frequently occurs in modules as well -- for example, a lot of the Powershell configsource modules lack sufficient error checking and unknown results replace previously known good results, leading to change thrashing. Or they often forget to sort/normalize output leading to similar effects. The good news on those is they usually (eventually) listen to me.

Anyone who wants to use my workaround can use this script (or at least the central logic if you prefer something other than Perl). I still lose data, but the window is smaller.

7 Replies

  • Ah, that’s from Device_BasicInfo, which really has too many properties for one module. They could put try/catch blocks in there, but if one single try fails, the whole thing would have to be ignored. Meaning that if one property failed to get fetched, updates for the other properties wouldn’t get processed. They really should break that PS up into multiple modules each with its own try/catch block. That way if one property fails, the blast radius is reduced.

  • For PropertySources and AD scripts (and i think ConfigSources), if the script returns a non-zero value, LM is supposed to ignore the task output altogether. So that should be codable using try blocks. Are you saying there are modules where failure can happen and it still returns a 0 without all the previously known properties? Which module(s)?

    We don’t have access to the code that does system discovery though so system.ips getting changed when a discovery task fails is worrysome.

  • Yeah, I also generally refuse to use the clone/edit method they espouse due to the obvious maintenance nightmare.  I recognized that almost immediately after starting to use the product many years ago and pushed for inheritance for a long time before giving up.  My code is outside LM, related to downloading configsource versions to email diffs via git commit hooks (since LM unbelievably lacks support for that natively). I have had to add numerous extra checks to avoid committing garbage to our repos. The garbage still exists in the LM change history, but not in our repos -- I do get error reports from the script frequently when garbage is detected and skipped.

  • For modules, yes, the cases I have seen are for things like AD configsource modules where lack of exception handling leads to fully or partially empty data that replaces previous results.  I’ve also often seen similar issues with network devices where you see errors, “false”, etc. replace the entire previous config. The newer modules are better but still suffer from this problem sometimes (see all the extra code I’ve had to add to avoid thrashing in my local repos in the lm-get-configs script in my github repo).

    I never did get a good answer on why an access blip causes buitin properties like system.ips to thrash, nor did they take it seriously enough to escalate to the developers.

  • Lack of exception handling should at least be in their jira backlog to fix. If it’s not, that’s really concerning.

    Not fully understanding your setup, but if you’re like me you want to avoid forking modules away from the repo version. I assume your deviations are what are in your github repo?

    I do wish they’d make the system discovery code accessible so we can help them debug it. They’ve been bleeding really good talent starting a year ago and the quality of code has suffered for it.

  • Related to the property thrashing, we see the same all the time for auto properties.  Admittedly, it could be tricky to code for, but setting a timer for valid lifetime of those could be one solution rather than deleting them because the corresponding modules failed to get data.  This is a sample of what I see happen all the time.