Forum Discussion

Vitor_Santos's avatar
5 years ago

Datasource to monitor Windows Services/Processes automatically?

Hello,

We recently cloned 2 Logic Monitor out of the box datasources (name -> WinService- & WinProcessStats-) in order to enable the 'Active Discovery' feature on those.
We did this because we've the need to discover services/processes automatically, since we don't have an 'exact list' of which services/processes we should monitor (due to the amount of clients [+100] & the different services/solutions across them)

After enabling this it works fine & does what we expect (discovers all the services/processes running in each box), we further added some filters in the active discovery for the services in order to exclude common 'noisy' services & grab only the ones set to automatically start with the system.
Our problem arrives when these 2 specific datasource start to impact the collector performance (due to the huge amount of wmi.queries), it starts to reflect on a huge consumption of CPU (putting that on almost 100% usage all the time) & that further leads to the decrease of the collector performance & data collection (resulting in request timeouts & full WMI queues).

We also thought on creating 2 datasources (services/processes) for each client (with filters to grab critical/wanted processes/services for the client in question) but that's a nightmare (specially when you've clients installing applications without any notice & expecting us to automatically grab & monitor those).

Example of 1 of our scenarios (1 of our clients):

- Collector is a Windows VM (VMWare) & has 8GB of RAM with 4 allocated virtual processors (host processor is a Intel Xeon E5-2698 v3 @ 2.30Ghz)
- Currently, it monitors 78 Windows servers (not including the collector) & those 2 datasource are creating 12 700 instances (4513 - services | 8187 - processes) - examples below

This results in approx. 15 requests per second

 This results in approx. 45 requests per second

According to the collector capacity document (ref. Medium Collector) we are below the limits (for WMI), however, those 2 datasource are contributing A LOT to make the queues full.
We're finding errors in a regular basis - example below




To sum this up, we were seeking for another 'way' of doing the same thing without consuming so much resources on the collector end (due to the amount of simultaneous WMI queries). Not sure if that's possible though.
Did anyone had this need in the past & was able to come up with a different solution (not so resource exhaustive)?

We're struggling here mainly because we come from a non-agent less solution (which didn't faced this problem due to the individual agent distributed load - per device).

Appreciate the help in advance!

Thanks,

  • 3 minutes ago, Stuart Weenig said:

    Yes, the groovy based WMI query could do a single call and grab all the data.

    Can you do WMI using groovy? I looked around before but didn't find an example. I try to use groovy when I can, although WMI is only supported on Windows collectors anyway.

  • Anonymous's avatar
    Anonymous

    Ooo, that's a great point. Yes, the groovy based WMI query could do a single call and grab all the data. You'd have to parse through it and print each bit out, but that's easily doable with a for loop.

    And keeping discovery using the WMI method is an excellent option.

  • From my understanding, the native WMI-based checks will make a new WMI call for each instance, so 1 WMI call for each windows service and process, hence why you see 12k of them. There are a lot of types of checks that work that way, but there is one option that will let you make one WMI call per device (if you can get all the data in one call) and extract  in bulk for all instances at once: BATCHSCRIPT. I'm not sure if it would completely help in your situation, but if you switch from native WMI to using something like a PowerShell or Groovy BatchScript, you can send one WMI query to the server and get data for all services/processes at once. Scripts do cause more load on the collector than most native checks, but 150 script instances (75*2) are likely less load then 12k WMI instances.. Actually I think the collector does WMI queries via powershell anyway, not 100% sure about that, so even less of a concern.

    You can still keep the old WMI AD method and just move Collector Attributes to use batchscript.

  • Anonymous's avatar
    Anonymous

    Seems like what it's coming down to is that you are trying to monitor more stuff than the current collector resources can handle. Only two options really: reduce collection (stricter filters) or increase collector size.  You're already excluding manual and disabled services, right?

  • Hello @Stuart Weenig,

    If I understood it properly, I think we did that already with a custom property to actually filter 'noisy' services using a regex expression (on a group/device level). That helps to create exceptions indeed.
    However, that leaves us with the same issue still. Due to the reasons I mentioned above.

  • Anonymous's avatar
    Anonymous

    Alright documentation on that DS has been updated in the repo. Take a look and see if it helps.

  • Anonymous's avatar
    Anonymous

    Take a look at these two datasources: https://github.com/sweenig/lmcommunity/tree/master/ProcessMonitoring. I just realized i haven't done any documentation on that part of the repo, so give me a few minutes and i'll commit some instructions.

    It doesn't change how many resources are used to monitor a process, but it does do what i think you were referring to in the nightmare scenario above. They let you specify an include and an exclude filter as properties on the device level or on the group level. So, you can just provide a regex expression to dictate what you want to include and what you want to exclude in active discovery.