Forum Discussion
- mnagelProfessor
And see that there is a no-data alert set on a datapoint in the tuning screen (requested previously and definitely related to this). Right now it requires module diving to even know it is set. This suggestion would definitely help there, but seeing it is set is also important.
Is this still coming?
We just found a bunch of devices not retrieving data due to MFA being enabled on an account and no alert was received so have lost months of data collection and nearly experienced an outage.
- Mike_RodriguesProduct Manager
@Gary Dewrell @mnagel we're looking at adding this as part of a later stage of the LM Exchange, basically extending the Alert Tuning concept. We're also looking to extend it to the global level so you can do it globally without editing the module.
Thank you. Do you have an estimate on time frame. Won't hold you to it just wanting a general ideal.
- DanBAdvisor
Hello looking at this since we are having this issue today on devices. There could be situations where client asks us for a report on a device(s) and when we go to check there is just "NoData" for months and no alert b/c the device responded to other checks so considered up yet the important metrics we are supposed to be getting are no responding. How do we alert on when NoData is returned on important metric checks?
@Michael Rodrigues any progress on the "Alert Tuning" concept you mentioned? This sounds interesting.
- DanBAdvisor
^Ignore this part of my post above:
Hello looking at this since we are having this issue today on devices. There could be situations where client asks us for a report on a device(s) and when we go to check there is just "NoData" for months and no alert b/c the device responded to other checks so considered up yet the important metrics we are supposed to be getting are no responding. How do we alert on when NoData is returned on important metric checks?I know we can alert on a DataPoint for a DataSource when NoData is returned but its at the global level.
- mnagelProfessor
Right. "no data" should be on par with critical, error and warning so the threshold can be overridden properly if needed for specific devices/groups. It is very hard to know without digging into each DP definition when "no data" will even alert (no indication in the tuning page) and it is definitely unclear when you can unambiguously check (most times at least, the "no data" status applies to a datapoint that otherwise has no competing alert threshold). I also run into embarrassing situations where data acquisition has silently failed due to collection faults -- LM has been adding more troubleshooters to help there, which is appreciated. Still, I just ran into an issue with the "new/improved" Cisco WLC datasource group where APs that are disconnected don't even have a "no data" DP anymore, so you cannot know this happened. In fact, that DS now removes dead instances immediately. Realized this weekend as we moved from Cisco WLC to Meraki that not a single alert generated from WLC APs as they were disconnected -- they just vanished from existence. Technically true, but should be in alarm until confirmed intentional. The only way to fix this now is to edit the DS, which I loathe doing as it (for now) severs my tie to the original and creates a risk that an update will break changes. I am not a fan of cloning as it also severs the tie to the original and makes updates painful.
- DanBAdvisor
Yeah similar situation happened today with a requested report for our Tegile storage devices for a client. Was asked for a report and when I went to go generate one it was empty. Looking back 3 months it hasn't collected any data except for a 1hr span when it was responding to snmp. The issue was with the Tegile device itself but again no alerts were generated. The device responded to other datapoint calls so it was considered responding but the actual storage points were empty. So yeah the tool defiantly needs some tweaking with regards to reporting when devices that were sending data now don't send any data better. Yes we can modify that DS and enable it on those data points but that's a manual 1x1 thing, time consuming and if the DS ever updates in the Exchange your back to square one.
- mnagelProfessor
For SNMP faults (assuming that is the issue here), we have some standard rules in all the clients we manage:
This seems to do the trick most of the time, but I am sure there are cases were are still missing. Those datapoints have "No Data" enabled and are not used for thresholds otherwise, making them unambiguous.
We also implemented the alarm for the SNMP uptime datasources, which covers situations where SNMP is misconfigured on not working. It will miss batchscript, script, WMI, etc so it's definitely not comprehensive.
The problem with "No Data", and I've had some back-and-forth with the CSM, is that there is currently no way to tell if the "No Data" is intentional (if the datapoint is not relevant given the current status) or cause for a problem (data should be returned or is not). This shows up in scripts where they will set a datapoint in some paths but not others. The new SNMP interfaces batchscript will return 32 or 64 bit datapoints depending on the interfaces and not return any value for the others, and the new SSL certificate monitor has a lot of datapoints that will only return values in certain situations.
The dev team noted that returning a '0' isn't really correct, but 'No data' is also a problem because it becomes impossible to tell when datasources or devices are not working correctly. I'm pushing on the CSM to prioritize some improvement that can be used to check for problems (clean up the environment) and/or alarm reliably when data isn't being returned.
@mnagel For your instance deletion issue check out this thread; I keep running into this issue and don't think there is a clean way to handle it yet.
/topic/5834-active-discovery-and-instance-deletion/
Related Content
- 5 months ago
- 5 years ago