Alert Tsunami: Why the Huge Delay and Flood of Post-Resolution Power Alerts?
Subject: Alert Tsunami: Why the Huge Delay and Flood of Post-Resolution Power Alerts? Hello LM Exchange community and LogicMonitor team, We recently experienced an issue that's causing significant frustration and making our alerting system less reliable. We had a couple of anticipated power cable pull-outs (testing/maintenance), which were quickly resolved. However, we then received a massive backlog of LogicMonitor alerts for this event hours after the issue was fixed and the system logs were clear. The Problem Massive Alert Delay: The initial power loss events occurred and were resolved around 7:00 PM and 8:00 PM (based on the Lifecycle Log). However, we started getting a huge flood of critical alerts via email at 9:13 PM, 9:43 PM, 10:13 PM, and 10:43 PM—hours after the issue had been mitigated and redundancy was restored. Excessive Alert Volume: We received dozens of separate critical alerts (e.g., LME205086576, LME205086578, etc.) for a single, contained event, all arriving en masse hours later. Past "Fix" is a Concern: The last time this occurred, the only way I could stop the flood of delayed emails was to turn off alerting for the device and then turn it back on. This is not a scalable or sustainable solution for a reliable monitoring platform. Key Questions for the LogicMonitor Team What is causing this significant delay in alert processing and delivery? It appears the system is holding a large backlog of alerts and then releasing them all at once hours later. What is the recommended, official way to clear an alert backlog without having to resort to manually disabling and re-enabling alerting? Is there a known configuration or polling issue that would cause a single event (like a brief power loss) to generate dozens of unique critical alerts over a short period, and how can we consolidate these into a single, actionable notification? Data for Review LogicMonitor Email Log (Image 1): Shows critical alerts arriving long after the issue was resolved (9:13 PM to 10:43 PM). Device Lifecycle Log (Image 2): Shows the power events (PSU0003, RDU0012) occurring and being resolved between 8:01 PM and 9:22 PM. Any insight or official guidance on how to prevent this "alert tsunami" would be greatly appreciated. We rely on timely and accurate alerting, and this behavior significantly undermines that trust.42Views0likes4CommentsDoes anyone monitor their iDracs with LM? Tons of duplicate alerts
Hi, I just added some of our iDracs into LM. While the monitoring seems to be working fine, the alerting systems seems a bit wacky. I have a server with a bad CMOS battery. This causes me to get 4 alerts from LM for the iDrac: The 3rd one is for the battery. The other three are "System" alerts that will seemingly alert any time anything has a problem. This seems like any time something has a problem, I'll get 4 alerts. I don't want that. ;) I'm wondering if I should just disable all the "System" and "Chassis" alerts so I just get the alerts for the actual component that's having the problem. Anyone else run into this? Thanks.89Views2likes3Comments