Forum Discussion

B1llw's avatar
B1llw
Icon for Neophyte rankNeophyte
2 months ago

Alert Tsunami: Why the Huge Delay and Flood of Post-Resolution Power Alerts?

Subject: Alert Tsunami: Why the Huge Delay and Flood of Post-Resolution Power Alerts?

Hello LM Exchange community and LogicMonitor team,

We recently experienced an issue that's causing significant frustration and making our alerting system less reliable. We had a couple of anticipated power cable pull-outs (testing/maintenance), which were quickly resolved.

However, we then received a massive backlog of LogicMonitor alerts for this event hours after the issue was fixed and the system logs were clear.

The Problem

  1. Massive Alert Delay: The initial power loss events occurred and were resolved around 7:00 PM and 8:00 PM (based on the Lifecycle Log). However, we started getting a huge flood of critical alerts via email at 9:13 PM, 9:43 PM, 10:13 PM, and 10:43 PM—hours after the issue had been mitigated and redundancy was restored.
  2. Excessive Alert Volume: We received dozens of separate critical alerts (e.g., LME205086576, LME205086578, etc.) for a single, contained event, all arriving en masse hours later.
  3. Past "Fix" is a Concern: The last time this occurred, the only way I could stop the flood of delayed emails was to turn off alerting for the device and then turn it back on. This is not a scalable or sustainable solution for a reliable monitoring platform.

Key Questions for the LogicMonitor Team

  1. What is causing this significant delay in alert processing and delivery? It appears the system is holding a large backlog of alerts and then releasing them all at once hours later.
  2. What is the recommended, official way to clear an alert backlog without having to resort to manually disabling and re-enabling alerting?
  3. Is there a known configuration or polling issue that would cause a single event (like a brief power loss) to generate dozens of unique critical alerts over a short period, and how can we consolidate these into a single, actionable notification?

Data for Review

  • LogicMonitor Email Log (Image 1): Shows critical alerts arriving long after the issue was resolved (9:13 PM to 10:43 PM).
  • Device Lifecycle Log (Image 2): Shows the power events (PSU0003, RDU0012) occurring and being resolved between 8:01 PM and 9:22 PM.

Any insight or official guidance on how to prevent this "alert tsunami" would be greatly appreciated. We rely on timely and accurate alerting, and this behavior significantly undermines that trust.

4 Replies

  • Could you share what type of alert it is, Ping, SNMP or WMI, along with an example alert and the thresholds set at the folder, device, or module level?

  • I guess that SNMP IDRAC LOG it de snot seem to happen all the time. 

     

  • Have you verified your org's email filter isn't the culprit? If there is a tsunami of alerts coming in, an email filter might start blocking all emails from a sender for additional sandboxing. You can verify the true sent time in the email headers. Whitelisting LM emails might fix this. If you're not allowed to whitelist LogicMonitor with your email filter, you could also look cluster alerting and disabling individual alert notifications.
    https://www.logicmonitor.com/support/cluster-alerts#h-managing-cluster-alerts

     

  • If your using LM's default SMTP test with your own mail relay to avoid queue delays. Adding throthling can avoid queue delays aswell