Forum Discussion

stuart_vassey's avatar
7 years ago

Ability to apply retroative SDT-like windows to adjust SLA numbers

When a monitor malfunctions (e.g. service test fails because the monitored site HTML was updated) the uptime for that test will not reflect the site's actual uptime. We'd like to be able to apply retroactive SDT-like windows that would prevent an alert period from counting against a test's uptime. I thought this ability was already available by applying retroactive SDT's, but apparently this isn't actually working.

Without this feature, a dashboard SLA widget might report a service/device being up 70%, but it was really up 90% while 20% of that time it had a faulty test/datasource. We should be able to apply an SDT to keep that 20% from negatively impacting the uptime.

  • I would like this functionality to cover when a new data point is added in the middle of a month, there is no data for the time that the sensor didn't exist and all of that gets counted against our SLA.

    The only options we have now are:

    • exclude the device from the report, which means next month we have to remember to go back and re-add the device to the report
    • Set missing data to not count against the SLA, which now makes the SLA completely invalid because when a device is down it is missing data

     

  • In the real world, in every organisation that I've worked in, there has been a requirement to retrospectively adjust SLAs as sometimes root cause analysis can take weeks or months in complex environments.  

    Rather than tweak SDT, could we have another type of entry that we could apply in a very similar fashion, and would be taken into account by the SLA algorithm in the same way it takes SDTs into account?

  • This would also be useful for scenarios where the data collection malfunctioned, but it was an issue on the Collector side rather than the device side.  Being able to correct SLA's for these devices by applying the retroactive SDT would help keep those SLA numbers accurate.

  • Hey All,

    We have had the request for retroactive SDT in the past.  LogicMonitor doesn't plan on adding this feature because allowing retroactive entries kinda takes the S out of SDT.  The intention is to have those entires scheduled ahead of time so that alerts are suppressed and SLA numbers are accurate based on previously scheduled downtime entries.

    We will look into iqmsjoel's comment about mid-month data vs. data that is missing but was actually polled for.

    ~Forrest

  • Forrest, when SLAs are reported as un-met, but it wasn't the service owner's fault, there needs to be a way to not penalize them. For example:

    1) When the LogicMonitor platform has an outage (I've seen this happen and created tickets)

    2) When a new EA collector crashes or has a bug with script execution (I've seen this happen and created tickets)

    3) When the monitoring team accidentally breaks a datasource, which mistakenly marks a service down (this is not uncommon)

    In all of these cases, SLAs would be inaccurate and if the monitoring team has no way of applying retroactive adjustments, we'll have to explain to the business that "this month's numbers aren't valid and can't be used for business purposes." That's not the story we want to tell about LogicMonitor since it hurts the credibility of the platform. I hope you see that this is different than just forgetting to schedule a maintenance window. If LogicMonitor wants to become more business-facing, this is a necessary feature.

    Thanks,

    Stuart

  • On 8/31/2018 at 4:25 AM, Mosh said:

    In the real world, in every organisation that I've worked in, there has been a requirement to retrospectively adjust SLAs as sometimes root cause analysis can take weeks or months in complex environments.  

    Rather than tweak SDT, could we have another type of entry that we could apply in a very similar fashion, and would be taken into account by the SLA algorithm in the same way it takes SDTs into account?

     

    Totally agree Mosh.  We want to use LM as a business reporting tool, but it's extraordinarily labor intensive to export all of the outage events, remove errornous entries and recalculate the SLA for each service, each month.  Another type of entry (EDT - explained/excluded downtime) to apply retrospectively such that pulling SLA data at EOM results in business-accurate data being returned through the built in LM API/dashboards/custom reporting, rather than recalculating everything manually.