Forum Discussion

mnagel's avatar
mnagel
Icon for Professor rankProfessor
8 years ago

datasource migration function

I have run into too many cases now where a new but slightly different DS is setup due to LM support actions, upgrades, etc. and the result is lost data or noncontinuous data.  A good example I recently encountered is with NTP.  The standard DS was not working in all cases.  I was given a new DS that uses Groovy, and it works (which I appreciate!).  But the datapoint list and names have changed, and even if they had not, there is no way to maintain data history from the old DS to the new DS.  My recommendation is to add a migrate function so you can indicate how to map old to new datapoints in such a situation and thus avoid data loss.  Building in a default migration ruleset into a new DS would be a bonus -- this could allow for zero-touch data migrations in at least some cases.

Thanks,
Mark

 

  • Two things:

    1.  I'd greatly appreciate it if you could share that datasource.  Is this the one in the official repository?

    2.  I largely agree with your point that it's not always obvious when a datasource change or update is going to cause data loss - a pain I've experienced a few too many times.  Even when updating official datasources, it's a risk due to the custom applies-to functions we might be using.  It would be great if there was at least some logic that allowed the import of a datasource, but allowing the administrator to choose to override the applies-to function or not.  Or maybe even (for advanced users) make manual changes to the XML doc before importing to prevent datapoint renaming. 

  • I agree too that updating DataSources presents a data loss risk.  I've stopped updating DataSources now, unless we find a bug.

  • @Brandon I am not sure if the fixed NTP DS is the one now in the repo, I was given it by support after I found the original one broken on various devices.

    @Mosh Yeah, I have been bitten by the DS import many times.  A big problem is the difference display (see recent FR I posted on that).  A simple first pass on this would be to prevent importing a replacement DS if any datapoints would be removed without a force indicator.  That is really a no-brainer and would avoid many problems.  My recommendation earlier is how to handle transformation of one DS to a new DS when the datapoints are basically the same from one to the other, but renamed/reshuffled.  In my experience, there is not nearly enough attention to this sort of thing, which is unfortunate, as data loss is a really bad result in a monitoring solution.

  • This is another area in LM long overdue for improvements and it just came up again due to the new "Gen 3" VMware datasources.  In some cases, "Gen 2" datasources were split into multiple new datasources (a bit trickier to deal with), but in other casses the name was changed and either nothing else changed, or perhaps some datapoints changed.  Changing the name of a datasource causes all of the following issues, probably others I am not thinking of:

    * historical data loss
    * reference breakage (widgets, alert rules, etc.)
    * instance tuning loss (custom thresholds, instance descriptions, group tags, etc.)

    It is understandable that the naming change clearly makes the DS set "go together", but the benefit of that is far lower than the problems created.  If there was a method to upgrade/migrate the existing datasources so the new ones take effect without breaking stuff as I asked for in Aug 2017, that would indicate the developers understand this system is for monitoring and should not be arbitrarily broken for aesthetic reasons like in this case.  I was advised by support that the solution is just "don't use the new datasources".  Surely this is not a suitable answer?  I have been told there is work being done on this front, but it is unclear what will come of it and when.

  • Hi, 

    Monitoring Engineering team lead here.

    I wrote these and would like to provide some insight into why the VMware Gen3 datasources are the way they are. We introduced a slew of history breaking changes which run a lot deeper than aesthetics for many reasons. Including better alerting (e.g. ESX datasources only alert when a server is in standalone mode), new data points (status data points for everything, from VM's to Datastores, we added support for relaying vCenter alerts) and a easing the collector's load on the VMware API. And yes, aesthetics played a part, I think these look nicer, but I'm biased.

    In the end I made the call to split and rename the datasources for the following reasons.

    • vCenter and ESXi are different products, they expose similar API's but they react to it in different ways. We want the flexibility to extract different information from both these products independently.
    • Improved monitoring now > keeping history. It's always a balance but in this case new features justified it. 
    • We renamed it in such a way customers could choose to keep the old ones without us removing some data points. When doing destructive operations it's always better to give you the power to decide if/when to do it.
    • These datasources pave the way for other features like topologies and such.
    • Some datasources change AD behavior completely (HW sensors) causing complete loss of history.

    That being said, we do need better module migration tools to make these changes less painful. We are keenly aware of it and it’s actively being worked on.

    Regards,

  • We're not currently looking into a way to "stitch" data across module name change updates, but we are working on a feature that would let you easily update and keep your changes to:

    -AppliesTo
    -AD/Collection intervals
    -AD Filters
    -Alert threshold changes

    It would be very similar to the Group/Device/Instance level Alert Threshold tuning we allow now, but we would extend it to the account-level, and add the above. This means that for core LogicModules, you would just update the main definition, and apply your tweaks on the Device tree.

    I know that doesn't solve the stitching problem, but it should make merge/updates easier, and would not require the use and management of clones.

    We also looked into a more advanced diff & merge option, but generally we know if you changed one of those things above, you want to keep it when updating.

    Changing something not listed above would require creating a new module definition.

    The great thing about this approach is that minor changes to the module don't result in a brand new module that you have to clone or manage.

    If you make frequent customizations to things not on that list, or you already have a lot of clones, there may be some pain in migrating to the new paradigm.

    This doesn't preclude further enhancements in the future, but we think this is a good iteration that will save lots of time with updates going forward. Any feedback is appreciated.

  • It is nice to hear this is getting some attention, but there are definitely more items needed.  There are at least two different issues with LogicModule maintenance. 

    The one in this F/R relates to upgrades (e.g., new version of VMware modules).  The hope was that you could retain previous data by matching DS/DP in the old to the new in a merge operation when importing the new datasource.  I get it is complex, but it is also frustrating to see loss of historical data treated so casually.

    The one you are referencing is about module parameter override preservation.  I like what you have above, but really, the correct solution for this is to allow linked cloning with inheritance and individual override of any or all elements in the module. There are already many examples of almost the same but not quite modules that could benefit from this.  A great example of how that leads to problems is the excellent changes made by Steve Francis to support ActualSpeed ILPs for interfaces.  Because that applies to only one (albeit common) case, all other interface DSes fail to get the same behavior.  If that was done in a base interface module and the rest inherited from there with overrides/additions for there specific needs, that could be a general improvement for all interfaces.

    Lacking that, I would ask that in addition to the above, the following also be preserved:

    • additional datapoints (including scripts, etc.)
    • changes to datapoint settings (perhaps not scripts, but alert settings including templates)

    I am dealing with a specific use case right now where a fairly complicated NetApp DS needs to be updated to alarm when space available drops below a threshold.  We can add a custom threshold, but then the alert template is default and the units are bytes.  I can fix this, but this creates a maintenance headache regardless of whether I update in place (new versions will kill my changes) or clone (new versions require manual review to sync into the clone(s)). 

    That said, the filter preservation would at least avoid issues with my changes to BGP- (filters admin down sessions) and changes to the collection interval for HP network devices (default is set to 1 day instead of 1 hour), so I will take what I can get!