Forum Discussion

Vitor_Santos's avatar
5 years ago

Monitor Linux Processes (via SSH)

Hello,

In our current monitoring tool we monitoring Linux processe profiles using a Regex expression to match one or more.
Example, we've a profile that will look into processes that contain /.*OpswiseAgent.*/ in their cmdline path.

Once those are running, the probe picks them automatically and monitors their state (not actually having the PID in mind because that might change).
In LM we would also not rely on PID since it might change (in terms of wildvalue).There can be also more that 1 process (with diff PIDs) running with the same exact cmdline (therefore it needs to pick those in diff. instances).

I'm just unsure how to have a working solution having in mind all of this (unique wildvalue & wildalias).
Can anyone assist here perhaps?

Regards,

 

  • Anonymous's avatar
    Anonymous

    Have you looked at mine? https://github.com/sweenig/lmcommunity/tree/master/ProcessMonitoring/Linux_SSH_Processes_Select

    Duplicate processes with the same command line and same name will be a problem if you're ignoring PID. Under manual circumstances, how would you differentiate between the two between sessions? I mean, if you logged in once and looked and saw process A and process B. Then if you logged in again 1 minute later and saw the same list of processes, how would you know which was you had previously called A and which one was previously called B?

    The answer is, of course, to run them in separate containers, but that's a different discussion.

  • 13 minutes ago, Stuart Weenig said:

    Have you looked at mine? https://github.com/sweenig/lmcommunity/tree/master/ProcessMonitoring/Linux_SSH_Processes_Select

    Duplicate processes with the same command line and same name will be a problem if you're ignoring PID. Under manual circumstances, how would you differentiate between the two between sessions? I mean, if you logged in once and looked and saw process A and process B. Then if you logged in again 1 minute later and saw the same list of processes, how would you know which was you had previously called A and which one was previously called B?

    The answer is, of course, to run them in separate containers, but that's a different discussion.

     

    Yeah I looked into your example but, that has the PID as wild value.

    That's exactly the tricky part of this. Cause nowadays our probe is able to catch those processes (even with same name) & alarm if those get stopped. I just don't know how to replicate this at LM.
    Was wondering if anyone might come up with a workaround that I'm not actually seeing.

    I've tried to come up with something that has the cmdline & then enumerating those as cmdline#1/#2 etc... but, if now there's 3 instances of the process running but later there's only 2... the third instance will return an alarm (cause we don't want to erase it, since we want historical data).

    I guess our only solution would be asking the client how many processes should be running with that cmdline & alert if those are lower than the expected.
    But this is downgrading the monitoring we're doing for him nowadays :( 

     

  • Anonymous's avatar
    Anonymous

    The only problem is with the duplicates because you can set the wildvalue to be anything. It doesn't have to be numeric. Just avoid/strip out special characters.

    The real problem is making sure that duplicates are kept straight between polls without something to uniquely pull the together. 

    You could consider doing a datasource that counts processes by name. Each name would be an instance and you could count the number of up processes. Set a static threshold when processes that should be multiple are not or set a property detailing how many of the same process should be running and compare that to how many are actually running.

  • Anonymous's avatar
    Anonymous

    Could be a situation where dynamic thresholds could really help as well. Let LM learn how many processes of a given name are running and alert you when it changes.

  • Yeah, I really think that's the most reliable option to pursuit.

    However, I've coded a DS just to see how it behaves (WinProcessStats_Responsiveness).
    Just in case you want to have a look. Only problem I'm having with that DS is that I don't have Active Discovery erasing the Instances (therefore they'll stay there alarming if that process is no longer running - which is kind of what we want but, not perpetually).

    This is why we're really leaning towards just expecting a number of process & alarm if it's lower that that.

  • Anonymous's avatar
    Anonymous

    You could set it to delete after 30 days instead of immediately.

  • I see but, it would reappear if the Instance gets discovered again but, not actually alarm is the process stops (since the instance disappears), right?
    Since the AD will no longer consider it an instance (when it stops). Maybe I'm doing some confusion

  • Anonymous's avatar
    Anonymous

    Yeah, well, the instance would still be present for up to 30 days in case the process comes back (with the same wildvalue/wildalias). However, it wouldn't be included in polling since it's not showing up in active discovery results. So, the flow would be:

    instance is up and monitoring, process goes down, instance alerts until the next active discovery when polling for it is disabled. if the process comes back up with the same wildvalue/wildalias within 30 days, polling would resume. Otherwise, it would get deleted at 30 days, along with historical data.

  • 20 minutes ago, Stuart Weenig said:

    Yeah, well, the instance would still be present for up to 30 days in case the process comes back (with the same wildvalue/wildalias). However, it wouldn't be included in polling since it's not showing up in active discovery results. So, the flow would be:

    instance is up and monitoring, process goes down, instance alerts until the next active discovery when polling for it is disabled. if the process comes back up with the same wildvalue/wildalias within 30 days, polling would resume. Otherwise, it would get deleted at 30 days, along with historical data.


    Gotcha! I guess for what we want here that's not good (since AD runs every 15 minutes)...

    Thank you for the help anyway Stuart!