Forum Discussion

mnagel's avatar
mnagel
Icon for Professor rankProfessor
7 years ago

convert alert status to unknown

I have a suggestion to fix an architectural problem within LM related to missing data.  Right now, knowing data is missing is very difficult, nearly intractable.  In some cases, with enough digging, you can find a DS author chose to set a canary datapoint to warn on no data that has no meaning otherwise, so it can be used in alert rules safely.  In most cases this is not true (or if it is, again, indiscernible without code review).

At the same time, LM persists status throughout the lifetime of no data.  For example, if a disk hits 99% full then that disk is removed, the datapoint will be in some non-ok status indefinitely, until the instance is purged.  My suggestion is that after some time passes (configurable at one or more levels), a datapoint with no data changes its status to unknown (or whatever label you prefer that means that).  Then we can write alert rules to care about unknown distinctly, knowing it can be detected on any desired datapoint.  If a problem was detected before transition to unknown, it is no longer considered a problem.  If the instance starts producing data again and the problem actually persists, then it will re-arm with the correct status.  The key problem I see here is that the desired time interval to give up and convert to unknown is probably longer than I might want to allow a no data condition to persist undetected.  Given that, it may be better to make unknown a first order condition (parallel to status).  I still think status should eventually reset when a long enough time passes without data, just not sure it should be the same timeframe.

  • I agree, this would be a good feature to have.

  • Any thoughts from LM on when this will happen?  We had a case today that was very confusing until you realize what is happening.  A switch rebooted due to a power failure and generated a reboot alert.  As something was causing the SNMP process on the switch to spike CPU, we ended up disabled SNMP for now.  The same reboot alert continues to fire indefinitely until the instance alert is manually disabled.  If that alert state had changed from warn to unknown (or stale or whatever you want to call it), we could have avoided sending meaningless alerts on stale information.  What is frustrating about this is there is no way short of some API hacks to handle it -- we really need LM to address this problem.

    I would really like to see LM respond to all feature requests with at least some ideas on if/when they will be addressed.