8 years ago
Anomaly detection
We have a linux based http load balancer that is being monitored for a few months now. Yesterday we got a call from very few customers saying that our site was a bit slow. Looking on LM alerts I sa...
I have requested this in other threads -- the fix is to enable evaluation of a condition over time, not just repeated over N samples. It is much more operationally important to know the CPU has averaged 50% over an hour than to know it spiked to 80% for a few minutes, and as you say it takes only one "good" sample to be blind to what is going on. A method that might be easier to implement is to require N out of M of the last checks to have failed, not just N in a row.
Similarly, it would be useful to get alerts on predictable resource slopes so you can get a heads up N days prior to resource exhaustion. This is at least addressed by forecast reports, but an alert that disk will be exhausted in a week on a volume would be much more useful in most cases.
Regards,
Mark