Forum Discussion

DanAvni's avatar
8 years ago

Anomaly detection

We have a linux based http load balancer that is being monitored for a few months now. Yesterday we got a call from very few customers saying that our site was a bit slow. Looking on LM alerts I saw nothing so I thought that it must be some slow internet connection on the customer side or some other slow down on the internet connection of our hosting facility (so few customers complained that I did not seem to need any further investigation). Then about 5 hours after these calls I suddenly got a LM alert that the CPU of the http load balancer was too busy. Looking on the graph of the CPU of the load balancer I saw that the CPU was using 50-90% all the time for a few hours - starting when the customers complained (compared to about 10-20% on the same time previous week). Because the CPU usage changed all the time up and down LM did not trigger any alert for a few hours so I did not know something is wrong until a few samples triggered the alert.

My suggestion: Have LM detect that something is not behaving as it normally does (compared to same period on weeks/months before). When an anomaly is detected it should be flagged with an anomaly color (I was thinking blue and a question mark icon) as it might be nothing but it could also be the first signs of a problem. I differentiate this from a warning alert as a warning is a definite value and this is just a speculation that something is not working as it normally does

  • I have requested this in other threads -- the fix is to enable evaluation of a condition over time, not just repeated over N samples.  It is much more operationally important to know the CPU has averaged 50% over an hour than to know it spiked to 80% for a few minutes, and as you say it takes only one "good" sample to be blind to what is going on.  A method that might be easier to implement is to require N out of M of the last checks to have failed, not just N in a row.

    Similarly, it would be useful to get alerts on predictable resource slopes so you can get a heads up N days prior to resource exhaustion.  This is at least addressed by forecast reports, but an alert that disk will be exhausted in a week on a volume would be much more useful in most cases.

     

    Regards,

    Mark

  • Mark, your idea is also needed but I do not think it's the same thing. Your average will see an average of the last 50 samples or 100 samples and will be able to tell you that the average is high.

    My idea does an average of the last N samples and compares that to the same period a week ago, two weeks ago ... X weeks ago and tries to see some anomaly in the current average (e.g. every Wednesday on the past 5 weeks the CPU at night is 10% and at day time is around 30%, this week on Wednesday at night the CPU is spiking at 90% and falling back to 50% so there is something different). This is not an alert because things might be ok, a new customer might have started working at night and is keeping the server busy or something else might be keeping the server busy. But at the same time this could be an indicator of a problem or a potential problem

  • True, that is different and also useful - sounds like what AppDynamics does with stddev monitoring.  Still need to check over a period of time, though :).

     

    Thanks,

    Mark