Forum Discussion

The_Other_Josh's avatar
5 years ago

Active Discovery and instance deletion

I keep waffling on whether this is a bug, feature request or I'm just thinking about it wrong.

 

Currently, if you configure instances to delete automatically it will prevent any alarms (the instance doesn't exist any more).  At first, support told me to 'delete after 30 days', which makes sense at a quick glance, but doesn't actually work; the instance doesn't exist, so there is no incoming data and hence no alerts to trigger.  The 30 days is just a way to preserve data in case the instance comes back (failed hardware, or intermittent service, etc).

 

This means that you cannot enable automatic deletion for any instance where you need to alert on a change that would result in it being filtered from active discovery.

 

Initially I had enabled automatic deletion for network interfaces, in order to automatically keep things clean if modules are added or removed, or interfaces change status.  However, we found that this meant we would never get an alert for an interface being down (since those are filtered from AD).  I disabled automatic deletion, but that leaves 2 problems:

 

1) We have to manually clean things (at the very least when changing hardware, modules, etc)

2) Even though the instance still exists, since it isn't picked up by AD none of the properties are updated.  We use an 'alert_enable' string, but the description never gets updated by AD so it continues to alert (even though it's changed on the device)

 

Options:

  1) Stop filtering in AD (expands instance count, collector resources, etc)

  2) Have LM alarm on removal of an instance (possibly overload the 'no data' rules, or add a different flag, possibly related to whether instance deletion is immediate or after 30 days)

  3) If not deleting instances, possibly have AD update any properties or other values that exist before doing the filtering.  This would let AD update descriptions, admin status, etc on an interface (allowing it to clear things like description, operational status, etc) - it's possible that is undesirable if people want to see the properties and values as the existed when the instance was 'live'.  

 

I'm just not that happy with any of the current solutions, but not convinced whether these possible features would be any better or more desirable.

  • Anonymous's avatar
    Anonymous
    On 5/29/2020 at 9:35 AM, The Other Josh said:

    Currently, if you configure instances to delete automatically it will prevent any alarms (the instance doesn't exist any more).  At first, support told me to 'delete after 30 days', which makes sense at a quick glance, but doesn't actually work; the instance doesn't exist, so there is no incoming data and hence no alerts to trigger.  The 30 days is just a way to preserve data in case the instance comes back (failed hardware, or intermittent service, etc).

    That's exactly the reason for this feature. The other reason is that in case the instance comes back, it just shows up as a hole in data instead of losing historical data and resuming as a "new" instance.

    On 5/29/2020 at 9:35 AM, The Other Josh said:

    This means that you cannot enable automatic deletion for any instance where you need to alert on a change that would result in it being filtered from active discovery.

    That's right. If an object has 4 states and you need to monitor for two of those states, those two states shouldn't be used as discovery filters, period. You'd need to find another attribute you can use to filter interfaces. AdminStatus is probably the best one for programmatically distinguishing between interfaces that are down on purpose vs. those that are accidentally down.

     

    Theoretical rambling (needs to be thoroughly thought through):
    You could get fancy and create an automatic aging property to exclude them from discovery X days after the inactive state is reached. On the first discovery cycle where the instance is down, you could set a timestamp in a property. The discovery script could check for the presence of this property and exclude the instance from the discovery output if the difference between property's value and now is greater than X days (which X could also be set as a property). This would allow alarming on an instance for X days after it goes down, then exclude it from discovery. Since the old interfaces DS uses SNMP as the collector, you'd have to switch over to a scripted DS. I think there is one coming out soon (or already available?) that is scripted to help with performance on very large devices. The property would likely need to be a device level property as a list of key/value pairs since 'instanceProps.get()' doesn't work (or does it?) in batchscript.

  • I ran into this again, and I think that the current implementation is lacking.

     

    The problem is that there are many types of instances that can be 'down' in normal operation, but you want to alarm if they were up and now are down.  However, if they are down when you add the device, you don't want to add the instance and have to manually delete them.

     

    Several examples:

    Unused redundant power supply

    Unused stack port

    Unused interface

    The current configuration leads to errors; I had support tell me to switch to 'save for 30 days' to get alarms for deleted instances, and today found that the 'Cisco Switch Stack Ports-' datasource from LM has the same problem (it filters on status and alarms on status).  These are also the worst type of errors because they are generally only visible when bad things happen and you don't get notified.

    You could have an option for a filter to only apply on the first device discovery (I don't know what other issues that might create).  You could also had instances alarm if they are removed by active discovery (maybe only if the 'save for 30 days' option is enabled).  I am in favor of having instances with the '30 day' option still be visible in the tree so that you can reference historical data, so it wouldn't be too hard to extend that concept to an alarm.

  • The filter option can't be only on device discovery, but filter on instance discover (ie it would stop adding an instance, but it would not be used to remove an existing instance).

  • I figured things out after contact with support; I got twisted because it appeared that we lost a stack port but it wasn't alerting.  It turns out that it was never up, but that put me on the wrong track