ContributionsMost RecentMost LikesSolutionsDEVIATION FROM ROLLING AVERAGE I see a need in the design to alert on deviation from rolling average: example 1: Temperature in hardware is based on fixed baseline (default or manual adjusted) or based on fixed Delta. In real world application it would Make a LOT more sense to alert on Deviation from a 5 day or 30 day rolling average Temp of the box. Reason is, units alarm on the weekends because the office shuts off the AC during the summer. or they alert During the week 9-5 because in the winter the offices crank the heat. All of these ignore nuance of RANGE and Average expectation for the location...The alerting should just be how FAR outside the average Range for the site is. My Nashville facility hovers from 56 to 59 all week. I have it set on 57 so I get alerts at least once a weekend. I could move it to 59...but that's a band-aid. The REAL solution would be to have the software TRACK the last 30 days, and alert when we're outside the NORM for that location. furthermore....with hardware it is not the specific temps that kill the hardware....its the RATE at which the temp changes. so, the alerts SHOULD be based on the average range the system has seen in the last 30 days, and alert ONLY when the rate of change accelerates...but I imagine THAT request would be more challenging to reduce to an algorithm. Example 2: PING times.....I have sites where the Latency range is EXTREME (Mumbai, Johannesburg, Taipei etc...) I'd wished the PING would track the 30 day range and common deviation from norm and alert when the sites see latency that is way outside the expected fluctuation range. 30ms typical 90% of the time + 200-500ms spikes 10% of the time. when Ping times hit 300 ms for more then 10% of the last hour of sampling....then notify warning to inform of change in TREND....not fixed threshold in immediate sample Re: Syslog "Cleared" = MEANINGLESS Treating all events equally is a very bad policy. A dropped ping, is very different from a syslog message about parity failure in a RAID or DBReplication failure alert. Note in MOST (if not all) situations where we use Syslog....the effects reported would NEVER clear....at all without a tech going in and doing work. the software sending a "cleared" email is the opposite of crying-wolf.....it's crying "all clear" when the wolf is right there chewing on your leg. I'm very frustrated that I have to explain this twice....the customer that suggested it doesn't understand what syslog "is". The engineers that honored the flawed request did that customer...and the rest of us...a grave disservice. You've forced me to train my people to ignore certain messages. The work around we have now is to filter all your Cleared messages: Still, not sure why you didn't lead with that advisement. You can't convince me to change my opinion on this topic : it was a poor design decision you've made here. In time, I am very certain you will see why I am right. Syslog Timestamps and RFC's Syslog issues: 1. Being bound to only the two RFC for syslog is near sighted: syslog / timestamp / formatting should be more flexible. 2. the biggest concern I have is that Syslog should reflect the time stamp of the COLLECTOR'S NIC at the time the syslog packet ARRIVES at the collector....not the syslog / timestamp of the system sending the message : this is especially important with systems where clock settings or NTP are currently failing......alerting is based on the time stamp : if the time stamp says Jan 1st 2001 12:01am becasue the CMOS battery on the unit failed......than we NEVER see those syslog messages due to alerting range. Syslog "Cleared" = MEANINGLESS Syslog Issues: #1. The person who asked to have SYSLOG present a "cleared" message.....CLEARLY does not understand that a SYSLOG is NOT A tracked condition like an OID value is....it is a SINGLE SPOT in time....and event that "happened" and does NOT "clear" as you can't change the past. #2. The programmers HONORING that (deeply flawed) request frustrates me to no end.....team, I get the mantra "the customer is always right" .....except when they're wrong it is in EVERYONE's best interest if you retrain the un-skilled users in what a baseline understanding should be. I have no tolerance for bad design making it into development when people should know better. #3. You should have provided those of us who know better, a way to OPT OUT of these bad design decisions. Re: "No Data" alerts Check to see if the SNMP "Host Agent" was running at all during that time: or if the System was running DRF at the time of the problem. When dealing with CUCM : one constant I see is that DRF stops and starts services at will.....so all alerting should be disabled during DRF scheduling. - M Re: No data threshold I have another post elsewhere additionally requesting this. GLAD to see great minds think alike . Re: "Filter" Integration I'd like to see the "Time Range" selection effect directly what ALERTS are show, what "Raw Data" shows, and what the "Graph" shows in a unified way. - M Re: Custom Ping Intervals I have a number of events with lost ping, where the data source shows that ping is failing...but it happens WHILE I"m on both the collector and the end-target host, and I'm running manual pings between them with no drops at all. The only solve I've found is to reboot the collector that has the issue.. TraceRoute / TraceRt Is there a pre-package data source or service that runs constant TraceRt between Collector and host, and graphs the results? Uptime - Bug Report I’d like to re-initiate this bug report. The Uptime resetting counters at 497 days or 469 days (historical) I just had a similar false alarm telling me that my devices rebooted, when they did not. Please have the DEV team review this specific monitor and determine how the system can display 497+ days “uptime” --------------------------------- ________________________________________ SEP 11, 2015 | 01:56PM CDT Original message ________________ wrote: Support team at logic monitor, Is it possible to request adjustment to the "Uptime" data source monitor so that it does not alert when the counter resets from 11111111111111111111111111111111 to 00000000000000000000000000000001 The developer was aware enough of the event cause to code explanation in to the system alert: could the alert be altered to not-alert when the counter resets? - ________________ From: ________________ Sent: Friday, September 11, 2015 2:44 PM To: Subject: SC# Error: 6348 ________________ is reporting it has only been up for 0.43 minutes Hello ________________ , We have received the following monitoring alert and a ticket #6348 has been created to track your issue. An engineer is assigned and is working to resolve this issue. Thank you. We are investagating if the VM really did reboot or if this alert is coming up for a different reason: ________________is reporting it has only been up for 0.43 minutes, as of 2015-09-11 14:28:48 EDT. If this was an unexpected reboot, please investigate the system logs. NOTE: if ________________has been up for 469 days without a reboot, this alert will trigger due to a counter wrap in the host. In this case, you may disregard this alert. (But the host is probably due for an OS update.) For any inquiries please contact our NOC at support@highpoint.com<mailto:support@highpoint.com> or call 1-855-485-8324 (TECH). Regards, ________________ NOC Support Engineer
Top ContributionsDEVIATION FROM ROLLING AVERAGERe: Syslog "Cleared" = MEANINGLESSTraceRoute / TraceRtAllow datapoint threshold to be measured as a Delta(percentage) change : Using prior poll outputSyslog Timestamps and RFC'sSyslog "Cleared" = MEANINGLESSRe: "No Data" alertsRe: No data thresholdRe: "Filter" IntegrationUptime - Bug Report