Forum Discussion

Joe_Tran's avatar
Joe_Tran
Icon for Advisor rankAdvisor
6 years ago

External Alerting - Script - Medium for self-heal/actions?

I will admit that I had completely forgotten that External Alerting was a thing. When we first started with LogicMonitor (like 3+ years ago), someone at LogicMonitor had mentioned External Alerting as a potential solution for a random use-case and I had immediately disregarded out of hand--favoring a Custom HTTP Delivery integration instead. 

Fast forward to now and this a href="https://communities.logicmonitor.com/topic/2245-feed-lm-alerts-to-splunk-tool-excluding-custom-email-delivery-method/?do=findComment&comment=5924" rel="">post, and all this recent talk about self-heal and actions and an idea was sparked. 

Some internal partners are building automation tools to resolve issues and are pretty comfortable with some DIY. Originally, I had figured I would have to get them setup with an AWS Gateway+Lambda function that can receive alerts triggered which would then start a cascade of custom code, in the correct AWS VPC, to self-heal--but why bother when we have external alerting right? The client environments that I monitor and that these internal partners manage have dedicated collectors in each client environment. Just assign that client's collector to that client's resource groups, throw in a broker-like script that takes in necessary resource metadata,  datasource, and execute the necessary remediation scripts. Disregard any alerts for datasources not supported by our self-healing project. 

This assumes that I'm interpreting External Alerting correctly. The key thing for this to work for my use case would be the ability to have External Alerting AND our normal Alert Rules apply to the same resources/alerts. The Alert Rules would still be responsible for delivering the alert to our ticketing system. Timing of when Alert Rules would trigger and when External Alerting would trigger would be nice. The support center page for this makes it seem the collector polls the resource group at regular, but unknown, intervals. The Alert Rules would populate alert with the ##externalticketid## and it would be neat to have the External Alerting also take that in as a parameter to update said ticket.

I would also need to know if the script executed from this is subject to timeouts, concurrency limits, etc or if there is a limit to the number of External Alerting configs. 

Am I way off base? 

  • P.S. If I have not gone crazy in my line of thinking, please update the REST API Documentation to include descriptions/methods/models etc for /setting/alert/internalalerts resources. ??

  • I've successfully tested using CredSSP delegation as a 2nd hop solution.  This is becoming a viable option (of course, but the time I finish, LM will have added it as a simple in UI solution).

  • I'm in a multi-tenant IaaS windows environment and I'll need to test my second hop authentication to make sure I can get to the servers we're monitoring from the Collector.  The External alerting documentation specifically says it runs the script on the collector.  As that is the case, I would imagine the script can do anything you could normally do in the script on that collector.

    Caveat from the documentation: "You can only have one External Alert entry per Collector."  If you had multiple scripts you wanted to be able to run, they'd need to be tokenized, then pushed through a communication script that starts the pertinent script on the collector which then performs the corrective action on the client.  With that in mind, I'd like to rally for an external alerting method that would allow the Collector assigned to a device to be passed as a token that can be used to define where a common script would run from to be able to reach the client VM.  I also would like them to be able to be a stage in an escalation... potentially using output from the script to inform the escalation chain somehow.  At minimum a true/false continue escalation return flag.

  • This is posted with no warranty.  Use at your own risk.  Don't test it in a production environment:

    # Requires "Credential Manager 2.0"
    Import-Module credentialmanager
    
    function test-credential {
        param (
            $credential
        )
        Add-Type -AssemblyName System.DirectoryServices.AccountManagement
        $DS = New-Object       System.DirectoryServices.AccountManagement.PrincipalContext('domain')
        write-output           "$($DS.ValidateCredentials($cred.UserName,$cred.GetNetworkCredential().Password))"
    }
    function get-creds       {
        param (
            $credName = "$env:USERDOMAIN\$env:USERNAME"
        )
        if ( $credName -eq "list" ) {
            # Select stored user from a list
            $credentials = Get-StoredCredential | ogv -PassThru
        } else {
            # Check for Stored Credentials
            $credentials = Get-StoredCredential | ? username -eq $credName
        }
    
        if ( $credentials.count -eq 1 ) {
    
            # Test to make sure creds work before moving forward
            if ( test-credential $credentials[0] ) {
    
                # They work, return the credentials
                write-output $credentials[0]
    
            } else {
    
                # They don't work, have the user update the password
                $cred = Get-Credential `
                    -UserName $credName `
                    -Message "Please update your password"
                     
                # Test to verify the new creds work
                if ( test-credential -credential $cred ) {
    
                    # Update stored cred and return the credentials
                    Remove-StoredCredential `
                        -Target   $credName
    
                    New-StoredCredential `
                        -Target   $credName `
                        -UserName $cred.UserName `
                        -Password $cred.GetNetworkCredential().Password
    
                    write-output  $cred
    
                } else {
    
                    # Updated creds failed, return FALSE
                    write-output $false
    
                }
            }
        } else {
    
            # Need fresh creds to store for future use
            $cred = Get-Credential `
                -UserName $credName `
                -Message "Enter account password"
    
                # Test to verify the new creds work
                if ( test-credential -credential $cred ) {
    
                    # Store cred and return the credentials
                    New-StoredCredential `
                        -Target   $credName `
                        -UserName $cred.UserName `
                        -Password $cred.GetNetworkCredential().Password
    
                    write-output  $cred
    
                } else {
    
                    # creds failed, return FALSE
                    write-output $false
    
                }
    
        }
    }
    
    $Computer      = "<1st-Hop Computer>"
    $TestComputer  = "<2nd-Hop Computer>"
     
    # This was tested with a 'Domain.local' type of domain
    $domain        = "<Domain>"
    $FQDN          = "$Computer.$domain"
    
    # These credentials need to have access to both computers
    # (assumed, not tested - based on my understanding of the CredSSP token auth process)
    $cred          = get-creds
    
    #region Enable-CredSSPDelegation
    
    # write-host   ""
    $sessionRemote = New-PSSession $Computer
    
    # write-host "get wsmancredssp from remote/delegate"
    $remoteSetting = Invoke-Command `
        -Session      $sessionRemote `
        -ScriptBlock  {
            Get-WSManCredSSP
        }
    
    # write-host "set wsmancredssp on remote"
    Invoke-Command `
        -Session      $sessionRemote `
        -ScriptBlock  {
            Enable-WSManCredSSP -Role Server -Force
        }
    
    # write-host "set wsmancredssp on local with remote as delegate"
    Start-Process     powershell.exe `
        -Verb         runas      `
        -ArgumentList "-command & {Enable-WSManCredSSP -Role Client -DelegateComputer $FQDN -Force}"
    
    # write-host "Connecting WSMan to remote"
    Connect-WSMan     $FQDN
    Set-Item          WSMan:\$FQDN\Service\Auth\CredSSP -Value $True
    
    sleep 15
    
    # write-host "Starting CredSSP session"
    $workingSession = New-PSSession `
        -ComputerName   $FQDN       `
        -Authentication Credssp     `
        -Credential     $cred
    
    #endregion
    
    write-host "--- Starting Processing ---"
    
    # This just looks for the print spooler service on the 2nd-Hop Computer
    # It runs the invoke-command on the 1st-Hop Computer to do so
    # This Scriptblock is where the payload goes
    Invoke-Command                      `
        -session        $workingSession `
        -ScriptBlock    {
            get-service *spool* -ComputerName $TestComputer
        }
    
    write-host "--- Finished Processing ---"
    
    #region Disable-CredSSPDelegation
    
    # write-host "Close Session"
    Remove-PSSession -Session $workingSession
    
    # write-host "Disconnecting WSMan from remote"
    Set-Item         WSMan:\$FQDN\Service\Auth\CredSSP -Value $False
    Disconnect-WSMan $FQDN
    
    # write-host "set wsmancredssp from local back to initial settings"
    Start-Process powershell.exe `
        -Verb runas `
        -ArgumentList "-command & {Disable-WSManCredSSP -Role Client}"
    
    # write-host "set wsmancredssp from remote/delegate back to initial settings"
    
    if ( $remoteSetting -like "*is not configured*" ) {
        Invoke-Command `
            -Session $sessionRemote `
            -ScriptBlock  {
                Disable-WSManCredSSP -Role Server
            }
    }
    
    Remove-PSSession -Session $sessionRemote
    
    #endregion

    No, I'm not OCD, what makes you think that?  I like neat vertical lines I can follow for reference in my code.  This all collapses very nicely in Powershell ISE.

  • In terms of timing between steps in an escalation, we're using blank steps to create time gaps in the escalation.  For instance, AOS services (Dynamics AX) often come up slowly.  We'd like our team to be notified of it if it hasn't com back in 5 minutes, then if it's been a half an hour, we need to either notify the customer or generate a ticket.  We set 5 minutes between steps, then set them up thusly:

    1. 1. 
    2. 2. email internal support
    3. 3. 
    4. .
    5. .
    6. 6. email customer
    7. 7. generate a ticket
    8. 8.

    We leave the last bit blank after generating a ticket to prevent the system from sending more tickets through our system (ours is currently driven via email as the escalation chain repeats the last step if left unresolved and unacknowledged.