Forum Discussion

Matt_Gauthier's avatar
7 years ago

Ad-hoc script running

Often when an alert pops up, I find myself running some very common troubleshooting/helpful tools to quickly gather more info.  It would be nice to get that info quickly and easily without having to go to other tools when an alert occurs.  For example - right now, when we get a high cpu alert the first thing I do is run pslist -s \\computername (PSTools are so awesome) and psloggedon \\computername to see who's logged in at the moment.

I know it's possible to create a datasource to discover all active processes, and retrieve CPU/memory/disk metrics specific to a given process, but processes on a given server might change pretty frequently so you'd have to run active discovery frequently.  It just doesn't seem like the best way and most of the time I don't care what's running on the server and only need to know "in the moment."  

A way to run a script via a button for a given datasource would be a really cool feature.  Maybe on the datasource you could add a feature to hold a "gather additional data" or meta-data script, the script could then be invoked manually on an alert or datasource instance.  IE when an alert occurs, you can click on a button in the alert called "gather additional data" or something which would run the script and produce a small box or window with the output.  The ability to run periodically (every 15 seconds or 5 minutes, etc) would also be useful.  This would also give a NOC the ability to troubleshoot a bit more or provide some additional context around an alert without everyone having to know a bunch of tools or have administrative access to a server.

  • In my opinion, an Ops Note (set specifically for that device) or a Note on the alert itself would be great.  One thing I do like about the Ops note is that it becomes visible on all of the graphs for the device which should coincide with the alert trigger time.  The one thing I like about putting a note on the alert is it would capture at the time the alert is generated and in the context of that alert.

    I don't really like the idea of creating a separate data/config/properysource for this - if you're datasource has to be configured separately (ie for thresholds) from the one generating the alert then you're not capturing the point in time when the alert or alerting condition was triggered.  How do you handle custom thresholds?  Would you be putting in an Ops Note in for as long as the condition exists as often as the datasource runs?

    There are ways to accomplish this in a customized way, but it would be so much more easy and helpful if LogicMonitor could automatically trigger a "post alert action" or something.  There could be 'canned' actions (like get processes/users/memory for high cpu alerts)  and ones that are customized via groovy/powershell integrations.

    Just my 2¢

  • If a complex datapoint could inject this info into the alert template, now that'd be awesome :)/emoticons/smile@2x.png 2x" title=":)" width="20">

  • @mnagel that's a good point. Opsnotes would be great to put an alert was generated and these were the processes working there, I pieced together a small powershell script that uses perfmon (so we can get the % without any additional calculations)

    you could convert this to groovy with a wmi query and use a complex datapoint with a groovy calculation with an if cpu datapoint is x then calculate and post the opsnote to the device. 

    Obviously this would need to be adjusted to fit LM wildcards and parameters.

    $Cred = (Get-Credential) #New-Object –TypeName System.Management.Automation.PSCredential –ArgumentList $User, $Pass 
    $HostList =@('server1','server2')
    
    foreach ($CurrHost in $HostList)
    {
        if((Test-Connection -Cn $CurrHost -BufferSize 16 -Count 1 -ea 0 -quiet))
        {
            (gwmi Win32_PerfFormattedData_PerfProc_Process) | foreach {if ($_.PercentProcessorTime -gt 1) { $_.name + " " + $_.PercentProcessorTime }}
      }
    }

    I've got a conference this week where I'll have some time may work on this. This is one of the biggest things for our engineers is capturing the processes as they're running when the alert triggers, most often times we miss it.

  • @Tom LasswellThe ops notes idea was not mine, someone else posted that in this forum recently.  I just liked it :).  I am not sure I would be OK with using configsources for this based on the cost element.  They obviously lend themselves to this very well since very little else records text in the system, but the billing model was designed for its intended purpose.  Using it this way would be a major cost increase.  I was thinking of using an eventsource, though, which could do the trick without impacting the cost structure.

  • @Mike Suding I saw this before, thinking about modifying this idea into pushing a configsource query and grabbing it that way. 

    @mnagel opsnotes seems like an interesting idea as well 

  • @Mike Suding I just read that post and it sounds hopeful, but I am confused about where this would end up so it is usable? It looks like it would dump into the collector log?

    I know you and I have talked about this and there have been some discussions recently about leveraging Ops Notes more readily to record information like this.  Ideally, the information can be stashed somehow (Ops Notes do seem like a good place if that could be supported within LM more readily), but also easily accessible so the information can be presented in alerts. 

    FWIW, we have a script we developed that get top 5 metrics via WMI alone (usage: /wm/bin/wmitop5 [--top N] [--sort-by {cpu|hnd|thd|mem} [-A authfile] host) -- this is used within our alert templating system via the callback function.  This is for our pre-LM legacy tool, and I really miss templating and callbacks :(.