B1llw
Neophyte
2 months agoAlert Tsunami: Why the Huge Delay and Flood of Post-Resolution Power Alerts?
Subject: Alert Tsunami: Why the Huge Delay and Flood of Post-Resolution Power Alerts?
Hello LM Exchange community and LogicMonitor team,
We recently experienced an issue that's causing significant frustration and making our alerting system less reliable. We had a couple of anticipated power cable pull-outs (testing/maintenance), which were quickly resolved.
However, we then received a massive backlog of LogicMonitor alerts for this event hours after the issue was fixed and the system logs were clear.
The Problem
- Massive Alert Delay: The initial power loss events occurred and were resolved around 7:00 PM and 8:00 PM (based on the Lifecycle Log). However, we started getting a huge flood of critical alerts via email at 9:13 PM, 9:43 PM, 10:13 PM, and 10:43 PM—hours after the issue had been mitigated and redundancy was restored.
- Excessive Alert Volume: We received dozens of separate critical alerts (e.g., LME205086576, LME205086578, etc.) for a single, contained event, all arriving en masse hours later.
- Past "Fix" is a Concern: The last time this occurred, the only way I could stop the flood of delayed emails was to turn off alerting for the device and then turn it back on. This is not a scalable or sustainable solution for a reliable monitoring platform.
Key Questions for the LogicMonitor Team
- What is causing this significant delay in alert processing and delivery? It appears the system is holding a large backlog of alerts and then releasing them all at once hours later.
- What is the recommended, official way to clear an alert backlog without having to resort to manually disabling and re-enabling alerting?
- Is there a known configuration or polling issue that would cause a single event (like a brief power loss) to generate dozens of unique critical alerts over a short period, and how can we consolidate these into a single, actionable notification?
Data for Review
- LogicMonitor Email Log (Image 1): Shows critical alerts arriving long after the issue was resolved (9:13 PM to 10:43 PM).
- Device Lifecycle Log (Image 2): Shows the power events (PSU0003, RDU0012) occurring and being resolved between 8:01 PM and 9:22 PM.
Any insight or official guidance on how to prevent this "alert tsunami" would be greatly appreciated. We rely on timely and accurate alerting, and this behavior significantly undermines that trust.