TEST ONLY - Troubleshooting Alert Delivery
Last updated on 09 January, 2021 Overview All alerts display within your LogicMonitor interface. You can additionally choose to have alerts routed (using alert rules and escalation chains) via a variety of delivery methods, including text, email, voice call, or integration with a third-party app such as a ticketing system. If you think you aren’t receiving routed alert notifications that you should be receiving—or if you think you are receiving too many alert notifications, follow the troubleshooting tips listed in the following sections. Troubleshooting Missing Alert Notifications The generation of alerts and subsequent routing of alert notifications has many moving parts in LogicMonitor. In addition, there are features that seek to intelligently suppress alert notifications under targeted circumstances in order to reduce alert noise. Review the possible causes for missing alert notifications in the following sections to see if any apply to your situation. Are the alerts being generated? First, it is important to distinguish whether your problem is with alertgenerationor alertdelivery. All alerts, whether routed or not, display on the Alerts page/tab in your LogicMonitor account. If you cannot find the alerts for which you think you should be receiving notifications for within the interface (make sure to manually include cleared alerts in your filter criteria), then the alert probably isn’t being triggered in the first place. In this case, you’ll need to adjust the triggering criteria (e.g.datapoint thresholds,website alerting configurations, etc.) such that alerts are triggered as you expect. Does the alert match an alert rule? If you do see the alerts within your LogicMonitor account, but you aren’t receiving alert notifications, then you need to determine whether you have an alert rule configured to route notifications for that type of alert. Remember that in order for alert notifications to be routed, the particular website, EventSource, resource datapoint, etc. must match analert rule, and this alert rule must reference anescalation chainthat contains the recipients that you want to deliver notifications to. In most cases, alert notifications do not reach their intended destinations because they are being matched to an unexpected alert rule. To troubleshoot this possibility, you can: Test alert routing, as discussed inTesting Alert Delivery. Display the Alert Rule column on the Alerts page to see what alert rule is matching the alert. For more information on customizing columns on the Alerts page, seeManaging Alerts from the Alerts Page. Was the alert triggered during SDT? Keep in mind that alerts that occur during periods ofscheduled downtime (SDT)display in the LogicMonitor interface, but are never routed for external delivery. A resource (or website, EventSource, etc.) that is in SDT is denoted with a unique clock icon throughout the LogicMonitor interface to help you quickly identify SDT status. Are alert notifications being suppressed by one of LogicMonitor’s AIOps features? It is possible that an alert could match an alert rule, but still not be routed beyond LogicMonitor’s interface. This scenario occurs if alert notification suppression is enabled via one of LogicMonitor’s AIOps features that serve to intelligently reduce alert noise. For more information on these features, seeEnabling Dynamic Thresholds for DatapointsandEnabling Root Cause Analysisrespectively. Is the escalation chain rate limited? If rate limiting is enabled for the escalation chain, the number of alert notifications that can be sent to the escalation chain in a specified time period is limited. For more information on rate limiting, seeEscalation Chains. Is the contact information for your user incorrect? Escalation chain recipients are typically specified using user accounts. If the information for a user in an escalation chain is incorrect, alert notifications won’t be delivered correctly. Double check the contact settings (Settings | Users & Roles | Users) for the user account in question. Is your receiving email or SMS gateway refusing messages or queuing messages for delivery? Alert notification messages could be refused or queued because of spam control, gateway misconfiguration, DNS issues, etc. Was the alert notification marked as spam by your email client? Check your spam folder. Is the missing alert notification for a Collector, website, EventSource, or external alert? Collector.If notifications for Collector down alerts are not being received, make sure there is a valid escalation chain specified for your Collector, as discussed inMonitoring Your Collectors. Websites.LogicMonitor uses checkpoints to determine if websites are accessible. Configured Web Checks and Ping Checks allow you to differentiate alert notification settings depending upon the failure of multiple or individual checkpoints. Make sure these settings are as you expect. For more information on alert settings for website, seeAdding a Ping CheckorAdding a Web Check. EventSources.LogicMonitor automatically suppresses some duplicate EventSource alert notifications. Review the duplicate suppression details provided inCreating EventSourcesto ensure behavior is as you expect. External alerting.Ensure that the referenced Collector is online. Troubleshooting Too Many Alert Notifications Receiving too many LogicMonitor alert notification emails can ultimately lead to alert fatigue and the ignoring of important alerts. Some tips for avoiding this undesirable situation include: Tuning your static datapoint thresholds to suit your environment, as discussed inTuning Static Thresholds for Datapoints. Enabling AIOps features that serve to intelligently suppress alert notifications for targeted situations. For more information on these features, seeEnabling Dynamic Thresholds for DatapointsandEnabling Root Cause Analysisrespectively. Avoiding routing all alerts. Some alerts, such as alerts with a severity of warning (as compared to error or critical), are better viewed regularly in LogicMonitor reports, or being posted to a ticketing system usingcustom alert delivery methods.9Views1like1CommentTEST ONLY - Troubleshooting Perfmon Access
Last updated on 17 March, 2023 Chances are that if you are an avid Windows user, you have probably come across thePerfmon utilityat least once in your exploration of system and network monitoring. Most of LogicMonitor’sWindows data collection usesWMI queries, but we do utilize Perfmon counters for our Windows SQL Server, Exchange (earlier versions), and SMTP DataSources. If you see gaps of No Data for these DataSources in particular, but the rest of your data collection (CPU, Disk, Memory, Ping, DNS metrics) is stable, then it is likely there is an issue with Perfmon on the monitored device. BecausePerfmon accesses performance counters from remote hosts,it is necessary that your collector services have LocalAdministrative privileges to access counters on your remote host if you are using a domain. If you are using a workgroup, your collector will need to be running under a user that has Local Administrative rights on the remote workgroup host that you are attempting to monitor. NOTE:If the WMI credentials set for your device include a domain\user, but the remote computer is in a different domain, and the user is local, you may need todefine pdh.user and pdh.pass propertiesto access Perfmon data. If pdh.user and pdh.pass properties are defined on your device, they will be used over any WMI username and password properties defined for collecting Perfmon data. Perfmon Connection Timeouts & Latency The most common symptoms seen when troubleshooting Perfmon instability are issues where Perfmon is having difficulty initiating a connection to a remote host. This must succeed between the collector and host in order for the LogicMonitorActive Discoverymechanism to detect which Perfmon performance counters are available on the remote host and to read data fromthe host. After DataSource instances have been added to the host, Perfmon connectivity must continue to function in order for data collection to operate in a stable manner. If connectivity is interrupted after DataSource instances have been discovered, you will end up with blank or spotty graphs that are returning No Data. To troubleshoot these connectivity issues: Remote into the collector machine under the user that your collector services are running under, and open a Run prompt > “perfmon.exe” to accessthe Perfmon GUI. Use the green “+” option to “Add” a new count. Specify the UNC path of yourremote machine. When you are finished adding the path, Perfmon will pause while it automatically attempts to make a connection. If the connection is successful, you should be returned with a listing of all of the available Perfmon performance counters for your remote machine. If the connection fails, please attempt the same procedure from the local collector to itself, and from the local host to itself, to isolate the issue down to the host machine or the collector machine. If you are unable to make a connection using the perfmon.exe utility on your collector to your remote host, the collector services will be unable to do so as well.This indicates an issue not with LogicMonitor, but with your Windows configuration orauthentication. Services & Dependencies If WMI-based data collection succeeds, but Perfmon DataSources fail & you cannot get a connection established via the Perfmon utility, the issue may be that certain services need to be runningon the host being monitored for Perfmon to respond to RPC queries. The following servicesmustbe set to anAutomaticstartup type: Remote Procedure Call (RPC) Remote Registry And the following servicesmustset toManualor greater startup type: WMI Performance Adapter Performance Counter DLL Host Performance Logs and Alerts Remote Procedure Call (RPC) Locator Ports Perfmon relies on inbound RPC port135 TCPand Windows SMB port445 TCPon the host. When troubleshooting Perfmon connectivity issues, please ensure that these ports are unrestricted in your firewall configuration. Permissions If the steps abovedo not resolve the issueand you suspect permission issues are to blame, you can work around this via regedit: Open regedit on the machine to which you are trying to connect to perfmon. Browse to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WindowsNT\CurrentVersion\Perflib. Right click on Perflib key and select permissions. Click Add and add Local Service with full control. Save and exit. Restart the Remote Registry Service. Rebuilding Perfmon Performance Counters If you happen to get a response via Perfmon utility and from the collector debug console, but you get an error similar to “Object Not Found” when you attempt to view a counter, try repairing the counters with the following commands: cd c:\windows\system32 lodctr /R cd c:\windows\sysWOW64 lodctr /R WINMGMT.EXE /RESYNCPERF A restart is advised, but if that is not possible: Stop and restart the Windows Management Instrumentation service (set to automatic if not already). Stop and restart the Performance Logs and Alerts service (set to automatic if not already). Also, seeSQL Server monitoring For additional help with troubleshooting your Windows Perfmon data collection, please submit a support ticket or use the chat with engineer link.9Views2likes0CommentsTEST ONLY - Troubleshooting NetFlow Monitoring Operations
Last updated on 17 March, 2023 Overview LogicMonitor offers several troubleshooting tools to help you identify and resolve issues with LogicMonitor’s monitoring of network traffic flow data. Read more about Troubleshooting Netflow Monitoring Operations. Have questions orfeedback? Please reply below.9Views1like0CommentsTEST ONLY - Troubleshooting WMI
Last updated on 17 March, 2023 Overview of WMI Access Permissions Note:A Windows Collector must be used in order to monitor Windows hosts. The LogicMonitor Collector primarily usesWindows Management Instrumentation (WMI)to monitor Windows servers. Most issues with the Windows task collection result from permission restrictions when the Collector machine attempts to query your hosts for data. In these situations, the credentials for both of your Collector services, including “LogicMonitor Collector” and “LogicMonitor Watchdog”, should reference either a Domain user that is an Administrative account on the hosts to be monitored , or a local administrator that will be available on each Windows host to be monitored by this Collector. To change the user the services run as, change the credentials in the “Log On” tab for both services, and then start the services again. If you cannot run the Collector under an administrator user, or if you are monitoring hosts between multiple domains and need to make a host-specific credential adjustment, followthese instructionsto add the “wmi.user” and “wmi.pass” custom properties to your host. The “wmi.user” custom property should be formatted as DOMAIN\USERNAME. To specify a local user rather than a domain user, replace DOMAIN with the ##HOSTNAME## token, ‘.’ or the machine’s name so that the wmi.user value is ##HOSTNAME##\USERNAME, .\USERNAME or MACHINENAME\USERNAME. Data Collection Failure due to WMI Vulnerabilities Issue When Microsoft identified critical vulnerabilities with WMI, it released a Windows DCOM Server security feature bypass (CVE-2021-26414) to address the security vulnerabilities. After applying this update on the server, we observed the occurrences of the event id 10036 in the DCOM RPC between the Client and Server communication. When the patch is installed on the server machine, the ‘RequireIntegrityActivationAuthenticationLevel’ registry value is disabled by default. When you enable it on the server (either without any changes on the client or updating the patch on the client), it has an impact on the DCOM RPC communication resulting in the “Access is Denied” error. To understand the issue in detail, see Microsoft documentationManage changes for Windows DCOM Server Security Feature Bypass. Solution It is therefore recommended that youfirst patch the Collector deviceand then the monitored deviceto the latest updates to resolve the event id 10036 issue. When the patch is installed on the client machine,by default it enables RPC_C_AUTHN_LEVEL_PKT_INTEGRITY on DCOM clients. As a result, both the DCOM RPC communication between the client and the server, and data collection in Collector is successful. To address the vulnerabilities, on June 14, 2022, Microsofthadprogrammatically enabled the hardening on DCOM servers by default thatcouldbe disabled via the RequireIntegrityActivationAuthenticationLevel registry key if necessary. Note:According to Microsoft, on March 14, 2023 hardening changes will be enabled by default with no ability to disable them. If you have bypassed the hardening that was released as part of June 14, 2022 patch, you have to take action now, because the setting will not work post March 14, 2023. Microsoft is addressing this vulnerability in a phased rollout. To know more about the vulnerability, solution, and updates, see Microsoft documentationWindows DCOM Server Security Feature Bypass CVE-2021-26414 WMI Services and Dependencies All of the following services should be running and set to an “Automatic” startup type for WMI monitoring on a Windows host: DCOM Server Process Launcher Remote Procedure Call (RPC) RPC Endpoint Mapper Windows Management Instrumentation And the following service(s) may be set to a “Manual” startup type: WMI Performance Adapter Using WBEMTEST for Advanced Troubleshooting To test a WMI connection manually, you will need to run the WBEMTEST utility from the host on which the Collector is running. The following steps describe how to connect to the remote computer and pass WMI queries using the Windows WBEMTEST tool, and you can use it to quickly explore or confirm WMI details. (See the sections below for additional detail.) ClickStart>Run…> “wbemtest” to enter the WBEMTEST utility. Click “Connect”. Then enter the local or remote host IP into the remote namespace field, followed by “\root\cimv2”, and credentials into Connection dialog. In the above example, we are attempting to check WMI connectivity of the host 192.168.23.1. ClickConnect3 If something is wrong that prevents WBEMTEST from connecting, anerror dialogwill show the reason causing the failure. If you connection is successful, you will be returned back to the main window, this time with additional options available. Click onEnum Classes…> toggleRecursive>OK This should return with a list of your available WMI classes. Most normal Windows installations have 800-1200 classes. If you do not get a list of classes returned, there may be an incompatibility between the WMI implementations of the different hosts. One workaround is to install a Collector on the same OS as the host you want to query (or on that very host.) Contact our support for additional troubleshooting and workaround options. Testing WMI Access from the Local Host To determine whether WMI is working correctly on the host, from the host that you are trying to query: ClickStart>Run... >wbemtest ClickConnect…> Leave defaults >Connect If this process fails, WMI/RPC may not running on this host, or may need to be repaired. It is also possible that your WMI class structure may be corrupted or is inconsistent. In this case, see the instructions to repair your WMI class structure inTroubleshooting WMI. If it succeeds, this establishes that WMI is working correctly on the local host. If local WMI access on the host works, you should isolate why the Collector is not able to collect data. If permission issues are suspected, try a remote WMI connection, specifying the credentials of a domain administrator account in your network, or local administrator that is available the target machine. If it succeeds, this establishes that WMI is working correctly on the local host and Collector machine, but the LogicMonitor services are running as an account with insufficient privileges. If WMI is working correctly, but it cannot be accessed from a remote machine, there may be firewall issues, access right issue or DCOM issues. See the section under Access Denied inthis articleor search support.microsoft.com for more information on how to troubleshoot these issues. Establishing WMI Access for Non-host-based Firewalls When using non-host based firewalls or third-party firewalls on Windows, you will need to open specific ports to allow for WMI communication. By default, port 135/tcp (RPC Endpoint Mapper) is used to establish communications. WMI is then assigned ports through DCOM and communications is handled over a randomly assigned port in the dynamic port range. In Windows Server 2008 and later versions, and in Windows Vista and later versions, the default dynamic port range changed to the following range: Start port: 49152 End port: 65535 Windows 2000, Windows XP, and Windows Server 2003 use the following dynamic port range: Start port: 1025 End port: 5000 Be advised that LogicMonitor does not provide support for customizations made to operating systems. The minimum number of ports required may differ from computer to computer. Computers with higher traffic may run into a port exhaustion situation if the RPC dynamic ports are restricted. Take this into consideration when restricting the port range. For direction in restricting RPC dynamic port allocation, see the Microsoft support article How to configure RPC dynamic port allocation to work with firewalls. Another option is designating a fixed port for WMI as discussed in the Microsoft support articleSetting Up a Fixed Port for WMI. WMI Error Codes Error: 0x800706BA RPC Server Unavailable Possible Issues: The Windows Firewall is blocking the connection. Quick fix: execute “netsh firewall set service RemoteAdmin enable” from command console at the monitored host (not the host on which the Collector is running). After passing this command, you can use the Windows Firewall snap-in console (wf.msc) to further tighten access to this port to be only be accessible by a certain host, user, or interface. For more information, seehere. For Windows Vista and later, seehere. Error: 0x80070005 – Access is denied by DCOM Possible Issues:The user does not have remote access to the computer through DCOM.Quick fix: Give the user Remote Launch and Remote Activation permissions in dcomcnfg. ClickStart, clickRun, typeDCOMCNFG, and then clickOK. In theComponent Servicesdialog box, expandComponent Services, expandComputers, and then right-clickMy Computerand clickProperties. In theMy Computer Propertiesdialog box, click theCOM Securitytab. UnderAccess Permissions, clickEdit Limits. In theAccess Permissiondialog box, select the user used by Collector in theGroup or user namesbox (for example, the following figure allows the user ‘logicmonitor’ to access WMI remotely). In theAllowcolumn underPermissions for User, selectRemote Access, and then clickOK. For more information, seehere Error: 0x80041003 – Access is denied by a WMI provider Possible Issues: If a user tries to connect to a namespace they are not allowed access to, they will receive error 0x80041003. By default, this permission is enabled only for administrators.Quick fix: An administrator can enable remote access to specific WMI namespaces for a nonadministrator user. In theControl Panel, double-clickAdministrative Tools. In theAdministrative Toolswindow, double-clickComputer Management. In theComputer Managementwindow, expand theServices and Applicationstree. Right-click theWMI Controlicon and selectProperties. In theSecuritytab, select the namespace and click theSecuritybutton. Locate the appropriate account and checkRemote EnableandRead Securityin thePermissionslist. Click theAdvancedbutton and highlight the user. ClickEdit… Ensure theApply to:field is set toThis namespace and subnamespaces The following figure allows the user ‘logicmonitor’ to access the WMI namespace ‘ROOT/CIMV2’. For more information, seehere. WBEMTEST works, but collector does not Possible Issues: Collector uses the wrong username/password Quick fix 1: If the device was already added into LogicMonitor,edit device’s wmi.user and wmi.pass properties. WMI Counter Repair At times you may find that no matter what credentials you use and and how many security hurdles you’ve bypassed, you still cannot fully monitor your Windows machine. In these instances, your operating system may have a corrupted or inconsistent WMI class structure. Other symptoms that you may be experiencing: Some WMI-collecting datasources are successfully returning data or have discovered instances, but (most) others are returning No Data. You may be experiencing unexplained errors such as “Empty result set”, ox80041003, 0x80041017 from the Collector debug, WBEMTEST utility, or your custom application. You receive a different WMI result set from the Collector debug vs WBETEST, or an error from one and not the other. Microsoft reportsthat this may happen when “… certain extensible counters corrupt the registry, or if some Windows Management Instrumentation (WMI)-based programs modify the registry”, but the exact nature of these issues is largely unknown and normally not worth troubleshooting extensively. You may use the sets of WMI counter repairs below to attempt to rebuild your WMI class structure: Registering New Counters & Restoring Default Settings CAUTION: These steps will overwrite all custom Performance counter registry settings that you may have configured and will replace them with default configurations. Logged in as an Administrator user, please run the following: cd c:\windows\system32 lodctr /R cd c:\windows\sysWOW64 lodctr /R winmgmt /clearadap Note: Deprecated for Windows versions post-Windows 2008. winmgmt /verifyrepository winmgmt /salvagerepository winmgmt /resyncperf sc stop WmiApSrv sc start WmiApSrv Rebuilding the WMI (CIM) Counter Repository If still having issues, or 0x80041003, “Empty result set” ; “Unexpected WMI query result”, “Expecting size 1, but got size 0” errors. Logged in as an Administrator user, please run the following: wmiadap /c wmiadap /f wmiadap /r winmgmt.exe /verifyrepository winmgmt /salvagerepository winmgmt.exe /resyncperf sc stop WmiApSrv sc start WmiApSrv Comprehensive WMI Class Rebuild Logged in as an Administrator user, please run the following: Change startup type to Window Management Instrumentation (WMI) Service to “Disabled”. Stop the WMI Service; you may need to stop IP Helper Service first or other dependent services before it allows you to stop WMI Service Rename the repository folder: C:\WINDOWS\system32\wbem\Repository to Repository.old Open a CMD Prompt with elevated privileges CD windows\system32\wbem for /f %s in (‘dir /b /s *.dll’) do regsvr32 /s %s Set the WMI Service type back to Automatic and start WMI Service cd /d c:\ ((go to the root of the c drive, this is important)) for /f %s in (‘dir /s /b *.mof *.mfl’) do mofcomp %s Performing a reboot after completing each fix block is ideal, but not absolutely necessary. Also, many of the above commands do not echo a response after completion, so do not be alarmed if you do not notice any changes occurring after passing a command. Additional troubleshooting may be performed using the Windows WMI Diagnosis Utility (wmiadiag.vbs). For more information, please seethis page. Some Objects Are Not Discovered or No Data Occasionally, LogicMonitor will not discover an IIS instance (or some other attribute) on a Windows server. This can occur when the performance classes are not correctly registered, or when your WMI class structure is corrupt or inconsistent. These issues can normally be corrected by running WMI counter repairs. Please seeWMI counter troubleshootingfor more information. Recognized Issues No Data Returned Windows may report No Data for page file statistics ifyou have a server configured for “Automatically manage paging files for all drives”, or if one of the other “Automatic” options is selected. If you assign a minimum value explicitly, then these counters will become populated. To explicitly assign a minimum value: Navigate to Control Panel > System > Advanced tab > Performance section > Settings > Advanced tab > Virtual memory section and click “Change”. In Windows 2008 and later, there is an option at the top called “Automatically manage paging file size for all drives”; set this to a value. Then set back to “Automatically manage paging file size for all drives”. UAC Locked WMI Classes There is a recognised condition in which monitored Windows hosts prevent access to all WMI classes except for Win32_OperatingSystem and Win32_Volume. To resolve this, the User Account Control (UAC) must be disabled on monitored Windows hosts. Note:Disabling UAC only applies to the built-in Administrator account and all other users who are member of the hosts local Administrators group. There are two methods in which UAC may be disabled. Method 1: Disabling UAC on UI using the Windows ‘Local Security Policy’. This method enables you to disable a single host. Follow these steps to disable UAC: On your machine, launch Windows and search forLocal Security Policy. UnderLocal PoliciesclickSecurity Options. A list of policies and their status is displayed. ClickUser Account Control: Run all administrators in Admin Approval Mode. A dialog box with options to enable or disable security policy is displayed. ClickDisabled. Note: If theDisabledoption is greyed out, it could be due to the configuration management (for example, Group Policy, DSC, etc) which is blocking the adjustment. (Optional) To understand the enable/disable options, click theExplaintab and read the details. ClickOKto disable UAC. Reboot the Windows OS to apply the changes. Method 2: Disabling UAC using the Windows Registry. This method enables you to disable multiple hosts at a time. Follow these steps to disable UAC: Locate the following registry subkey: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA Change the value of “EnableLUA” from 1 to 0 Reboot the device in order for these changes to take effect. This will disable UAC and permit data collection from all classes. Alternately, you can also use PowerShell to disable UAC on Windows hosts. Right-click PowerShell and select Run as Administrator to launch an elevated PowerShell console. Get current value. Get-ItemProperty -Path 'HKLM:\Software\Microsoft\Windows\CurrentVersion\Policies\System' -Name 'EnableLua' COPY Set EnableLUA value to 0. Set-ItemProperty -Path 'HKLM:\Software\Microsoft\Windows\CurrentVersion\Policies\System' -Name 'EnableLUA' -Value '0' COPY Reboot the OS to apply the registry changes. (Optional) You can rerun the “Get-ItemProperty” cmdlet to verify the changes. Additional Troubleshooting In other cases, monitoring will stop for some objects (such as disks) while other monitoring continues correctly. This may also indicate a WMI issue. Some options to resolve this may be: Ensure the Windows Management Instrumentation service is running. Try rebooting the system. For Windows 2000, Windows XP, and Windows Server 2003, download and runWindows PowerShell WMI. For Windows Vista, Server 2008, and Windows 7, run the “winmgmt /verifyrepository” command to check for an inconsistent repository Once you have gathered the data, review the Event Logs for WMI errors. If you have captured the output from a utility, review the logs and resolve any errors where possible. Since WMI is such an integral part of Windows Operating System, please engage a Microsoft Support Engineer for assistance.22Views1like0CommentsHow to contact Support when you really need it!
How to get help!?! Look, obviously we would like you to use the community as the first point of contact when encountering questions or problems, as chances are your peers have dealt with them before! But, if you’re encountering a problem the community really can’t fix, we’re also not going to make it difficult for you to find us—We’re here to help! What to check before reaching out to support: We have many clients that have used our software in ways we could have never imagined. check out some of the resources below! ➡️ Start by Searching and posting your question in our Product Discussion forums. ➡️ Check out our Product Tech Talk for how to’s, best practices and industry knowledge from our LM Tech Experts. ➡️. Check out our Troubleshooting Guides ➡️. Read thru Product Documentation ➡️. Check out our Support Center ➡️. Status updates: A live feed on the uptime of our platform ✅. Still no luck? Or if your issue requires our Support team to login to your portal, pleaselog a Support ticket. 🛑NOTE: Issues with 3rd party scripts and implementations Don’t get us wrong, we love 3rd party scripts just as much as any other person. However, given their nature (3rd party) we also can’t take responsibility for their inner workings. Please consult the 3rd party in question, or the developer that helped you set them up.100Views12likes2CommentsAlert Triage (i.e. Grouping & Alert Reduction)
Hi, Per discussion with Russ G. & Kenyon W. & Jake C. yesterday, I would like to submit this as a feature request to the DEV team and see whether there is any way to add this feature into future roadmap. In short, it'll be great if end user can configure multiple incident/alerts into 1 group and generate only 1 alert (with highest severity). Here is an example of Tomcat being shutdown which shows a number of alerts generated: 1. Tomcat shutdown ‘critical’ alert is generated (1 alert) 2. ActiveMQ consumer count of specific queue alert has reached zero ‘Error’ alert (about 10-12 alerts for our case) In this case end user would like to be able to configure such that LM will consolidate all alerts into one critical alert (i.e. all AMQ 'Error' alerts are cleared)? I saw something like this in PagerDuty and must say it’s a great feature to have in LogicMonitor to reduce # of alerts being processed by the TechOps team: https://www.pagerduty.com/blog/alert-triage/ Thanks & Best Regards, Horace0Views2likes3CommentsCommon issues : High CPU usage on the Collector
This article provides information on High CPU usage on the Collector . (1) General Best Practices (a) First and foremost we advise our customers to be on latest General Release Collectors (unlessadvised not to) . Further information all the Collector information could be retrieved on the link below : https://www.logicmonitor.com/support/settings/collectors/collector-versions/ Also on the release notes of each newer Collector version we will indicate if we have fixed any known issues : https://www.logicmonitor.com/releasenotes/ (b) Please also view our Collector Capacity guide to get a full overview on how to optimise the Collector Performances : https://www.logicmonitor.com/support/settings/collectors/collector-capacity/ (c) When providing information on High CPU usage it would be useful if you can advise if the High CPU usage is all the time or a certain timeframe only (also if any environmental changes were done on physical machine that may have triggered this issue). Please do advise also if this occurred after adding newer devices on the collector or if this issue occurs after applying a certain version of the Collector. (2) Common Issues On this topic i will go through some of the common issues which have been fixed or worked upon by our Development Teams : (A) Check if the CPUis used by the Collector (JavaProcess) or SBproxy or other processes. (i) To monitor Collector Java Process : Use thedatasource Collector JVM status to check the Collector (Java process) CPU usage (as shown below). (ii) To monitor the SBProxy usage : We can use the datasource :WinProcessStats.xml (for Windows collector/ For Linux data source (this datasource is still being developed) . (B) If the high CPU usage is causedby the Collector Java processes, below are some of the common causes : (i)Collector java process using high CPU How confirm if this the similar issue : In the Collector Wrapper Logs you are able to view this error message : In our Collector wrapper.log, you can see a lot of logs like the below: DataQueueConsumers$DataQueueConsumer.run:338]Un-expected exception - Must be BUG, fix this, CONTEXT=, EXCEPTION=The third long is not valid version - 0 java.lang.IllegalArgumentException: The third long is not valid version - 0 at com.santaba.agent.reporter2.queue.QueueItem$Header.deserialize(QueueItem.java:66) at com.santaba.agent.reporter2.queue.impl.QueueItemSerializer.head(QueueItemSerializer.java:35) This issue has been in Collector version EA 23.200 (ii)CPU load spikes on Linux Collectors As shown in the image below the CPU usage of Collector Java process has aperiodicCPU spike (on an hourly basis) . This issue has been fixed on Collector version EA 23.026 (iii)Excessive CPU usagedespitenot having any devices running on it In the collector wrapper.log, you can see similar logs as below : [04-11 10:32:20.653 EDT] [MSG] [WARN] [pool-20-thread-1::sse.scheduler:sse.scheduler] [SSEChunkConnector.getStreamData:87] Failed to get SSEStreamData, CONTEXT=current=1491921140649(ms), timeout=10000, timeUnit=MILLISECONDS, EXCEPTION=null java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at com.logicmonitor.common.sse.connector.sseconnector.SSEChunkConnector.getStreamData(SSEChunkConnector.java:84) at com.logicmonitor.common.sse.processor.ProcessWrapper.doHandshaking(ProcessWrapper.java:326) at com.logicmonitor.common.sse.processor.ProcessorDb._addProcessWrapper(ProcessorDb.java:177) at com.logicmonitor.common.sse.processor.ProcessorDb.nextReadyProcessor(ProcessorDb.java:110) at com.logicmonitor.common.sse.scheduler.TaskScheduler$ScheduleTask.run(TaskScheduler.java:181) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) This issue has been fixed on EA 24.085 (iv)SSE process stdout and stderr stream not consumed in Windows Please note this issue occurs on only on Windows Collectors and the CPU usage of the Windows operating system has a stair-step shape as shown below. This has been fixed in Collector EA 23.076 (v)Collector goes down intermittently on daily basis In the Collector wrapper.logs, you can see similar log lines : [12-21 13:10:48.661 PST] [MSG] [INFO] [pool-60-thread-1::heartbeat:check:4741] [Heartbeater._printStackTrace:265] Dumping HeartBeatTask stack, CONTEXT=startedAt=1482354646203, stack= Thread-40 BLOCKED java.io.PrintStream.println (PrintStream.java.805) com.santaba.common.logger.Logger2$1.print (Logger2.java.65) com.santaba.common.logger.Logger2._log (Logger2.java.380) com.santaba.common.logger.Logger2._mesg (Logger2.java.284) com.santaba.common.logger.LogMsg.info(LogMsg.java.15) com.santaba.agent.util.Heartbeater$HeartBeatTask._run (Heartbeater.java.333) com.santaba.agent.util.Heartbeater$HeartBeatTask.run (Heartbeater.java.311) java.util.concurrent.Executors$RunnableAdapter.call (Executors.java.511) java.util.concurrent.FutureTask.run (FutureTask.java.266) java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java.1142) java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java.617) java.lang.Thread.run (Thread.java.745) [12-21 13:11:16.597 PST] [MSG] [INFO] [pool-60-thread-1::heartbeat:check:4742] [Heartbeater._printStackTrace:265] Dumping HeartBeatTask stack, CONTEXT=startedAt=1482354647068, stack= Thread-46 RUNNABLE java.io.PrintStream.println (PrintStream.java.805) com.santaba.common.logger.Logger2$1.print (Logger2.java.65) com.santaba.common.logger.Logger2._log (Logger2.java.380) com.santaba.common.logger.Logger2._mesg (Logger2.java.284) com.santaba.common.logger.LogMsg.info(LogMsg.java.15) com.santaba.agent.util.Heartbeater$HeartBeatTask._run (Heartbeater.java.320) com.santaba.agent.util.Heartbeater$HeartBeatTask.run (Heartbeater.java.311) java.util.concurrent.Executors$RunnableAdapter.call (Executors.java.511) java.util.concurrent.FutureTask.run (FutureTask.java.266) gobler terminated ERROR 5296 java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java.1142) java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java.617) java.lang.Thread.run (Thread.java.745) This issue has now been fixed in Collector EA 22.228 (C) High CPU usage caused by SBProxy (i) CollectorCPU spikes until 99% The poor performance of WMIor PDH data collectionon some cases will cause too many retries will occurand this consumes a lot of CPU. In the collector sbproxy.log, you can search the log string as shown below and you can see the retry times is nearly 100 per request and subsequentlythis will consume a lot of CPU. ,retry: This is being investigated by our development team at this time and will be fixed in the near future . (3) Steps to take when facing high CPU usage for Collector (i) Ensure the collector has been added as a device and enabled for monitoring : https://www.logicmonitor.com/support/settings/collectors/monitoring-your-collector/ There are set of New Datasources for the Collector (LogicMonitor Collector Monitoring Suite- 24 DataSources) which as shown below and please ensure they have been updated in your portal and applied to your Collectorsand also ensure the Linux CPU orWindows CPU datasources have been applied to the Collector : (ii)Record a JFR (java flying record) in debug command window of the Collector : this can done through this method : // unlock commercial feature !jcmd unlockCommercialFeatures // start a jfr , in real troubleshooting case, should increase the duration a reasonable value. !jcmd duration=1m delay=5s filename=test.jfr name=testjfr jfrStart // stop a jfr !jcmd name=testjfr jfrStop // upload the jfr record !uploadlog test.jfr (iii) Upload the Collector Logs : From the Manage dialog you can send your logs to LogicMonitor support. Select the manage gear icon for the desired collector and then select 'Send logs to LogicMonitor': Credits: LogicMonitorCollector development team for providing valuable input in order to publish this article .16Views0likes0CommentsAlert Troubleshooting 101
One of the most common support cases we face every day is 'why am I receiving this alert', this article would explain to you the steps on how to determine why are you receiving the alerts. 1) Understand the alert received 2)Checking on validity via raw data and threshold 3)Checking on delivery 1) Understanding the alert received The first step when you receive an alert either via email, textor via any ticketing system is to understand the alert. Understand an alert is to look at which device is the alert for, which datapointand value of the alert. For example in an email alert message, it would appear as per below. LogicMonitor Alert: Host: ##HOST## Host Group: ##GROUP## Datasource: ##DATASOURCE## Datapoint: ##DATAPOINT## Description: ##DSIDESCRIPTION## Value: ##VALUE## Level: ##LEVEL## Start: ##START## Duration: ##DURATION## Reason: ##DATAPOINT## ##THRESHOLD## ##ALERTID## 2) Checking on validity via raw data and threshold Next, once you determined the alert source, you need to understand why this alert is triggered. This can be done by first looking at the threshold that is set for that particulardatapoint.After checking the threshold you can go to the raw data tab of the datapoint to check if it meets the threshold being sent. For example In this case, a critical alert was received and a threshold of 80 90 95 and an alert will only be triggered if you have 20 consecutive polls that fall within this range. Now the next step would be to check on the RAW DATA tab to determine if this condition was met. Judging from the raw data above if you look at the values all the 20 polls have met the threshold level of 80 90 95, but to determine the level of the alert it would be the last poll since the last poll was 96.67 will falls to the range of a critical alert thus a critical alert was send. 3) Checking on delivery The last process is to check the alert rule and escalation chain to see if it was applied to the correct rule and escalation chain. To do so you can go thealert tuning tab and check on the alert routing for that particular instance and datapoint. Here you can see that the Alert Rule applied is Critical - Default and the Alert Chain/Escalation Chain isCritical - Default. Under the Alert Chain is the list of email address that will receive a notification, when the threshold is met.1View0likes0CommentsCustom recommendation link page for each alert threshold definition
Per discussion with Jeff Woeber, I want to submit this as feature request in LogicMonitor end as each alert threshold within each datasource(e.g. Tomcat ThreadPool- ) can have its own wiki troubleshooting page. It’s be a great feature if LogicMonitor enables user to specify it’s own troubleshooting page as optional field for each datasource. Usercan customize specific wiki page as recommendation whenever an alert is sent to PagerDuty.3Views0likes5Comments