Forum Discussion

mnagel's avatar
mnagel
Icon for Professor rankProfessor
4 years ago

"look at me" alerts for widgets

I have spent much time the past few years grappling with how to handle alerting within LM.  The "LM way" is to not send too many actual alerts (via email, etc.) and instead review aggregates on dashboards. But, that is a mindless repetitive task people should not have to do to pick up on problems, and flooding inboxes is the only other option.  My suggestion is to implement a method within widgets to alert when data contained exceeds thresholds (or is abnormal).  As a very specific example, if I set a widget to show all core switch interface error rates, I would want to set an alert for the widget itself to let folks know it needs investigation (rather than have all interfaces individually alarm).  I understand that cluster alerts could handle some of this in a very rough manner (I have found cluster alerts rarely can be used due to limitations in how they are defined), but having a widget be able to alert when conditions are met would be above and beyond cluster alerts.  In the port errors example, I might set the condition to "one or more ports with at least 1% errors in or out" for core switches and "3 or more ports with at least 2% errors in or out" for access switches.  A widget in "look at me" state could also be indicated in the dashboard menu for drill-down purposes.  That state should also be something that can be used in alert rules, which would then represent the rollup condition in the widget instead of many alerts for its various datapoints.

  • Since I can't wait for this, we now have code to grab widget data for all supported widget types incorporated into our existing backup script (pulls virtually anything I can from the API into a Git repo regularly).  Most issues can be detected via exception (non-200 status code), some require a bit more analysis (no data in any line in a cgraph, for example). Working reasonably well now for the first phase, which is to be aware of busted widgets before we are embarrassed during client review. Next phase will be to analyze data more specifically to the context (once I figure out how to represent the widget check requirements).