Forum Discussion

DanN's avatar
DanN
Icon for Neophyte rankNeophyte
15 days ago

Ideas for devices generating a lot of alerts

I'm looking to be more proactive about the alerts occurring in my portal. I was hoping to create some logic where if a device generates a certain number of warnings over the course of seven days, generate an error alert. Then if say 3 errors generate over a seven day period, generate a critical alert for this.

A way to bubble up alerts so to speak, so I can focus on devices generating more than a normal amount of alerts and begin to determine whats causing the issues. The only idea I came across was creating a script to run an API query for the devices alert history to use, but the issue with this is that the alert will not clear for seven days since it's using the last seven days of alert history.

Any ideas how this could be created?

  • I think LM's solution would be to sell you Dexda, although I haven't used it myself and is limited to ServiceNow integration last I checked.

    For us, all alerts all go into a ticketing system, so in general we would run reports thru the ticketing system looking for that kinda thing, but that is more of a manual process.

    Regarding an issue with using API to query alert history. That sounds doable to me but I would implement something like that as an EventSource rather then as a DataSource. EventSources automatically clear themselves after a set time you set.

  • If written as a datasource, counts of events per timeframe can also self clear and you end up with an historical graph of "normal" to boot.  That can help identify cadences to the issues if there are any.  Spike at 3am every morning, etc.  I use a few X events / 5 minute 'Sources I've made to track quantity aberrations in logs.  Specifically, Security:4625 with all it's glorious substatuses.  That can show graphically when a service account has failed... and when and where brute force attacks are happening in a Windows environment.

    • DanN's avatar
      DanN
      Icon for Neophyte rankNeophyte

      Is there a datasource example in LM that you would recommend using to replicate your idea?

  • Not specifically... some of the collector script amount data is similar to them...  I'm using get-winevent in powershell to filter logname=security, id=4625, startdate=(get-date).addminutes(-5)

    Then setting the collection frequency to 5 minutes (matching the addminutes)... if 15 minutes, change that to -15... 30m = -30

    I grab those remote, then pull back the result to the collector and do followup processing there to save load on the customer's production environment.

    $badPass = $events | where message -match "006A"

    write-output "badpass = $($badpass.count)"

    Pick that up in a datapoint and you're good to go.  Basically, grab a set of things from a specific timeframe that matches the collection frequency... then count them and pass that number to a datapont using key/value interpretation