SDT UIv4 after winter time
We are experiencing a very weird bug with the SDT function in UIv4. When selecting any date before “winter time”start, it works just fine (as long as we use 12H format…Why can’t we choose between 12 and 24?). Also UTC shouldn’t be affected by winter time. Here you see that we can easily put an SDT as expected When trying to put an SDT after winter time starts (29th of October) we are unable to directly write the time, like we did in the gif above. When trying to write 10:00 it just increases the number until it changes the date and we have to go back to select the right date. Also we get an error saying that the format must be HH:MM, which it clearly is but we are still not allowed to put an SDT on it in UIv4. All of this works just fine in UIv3.86Views1like2CommentsAnybody else disabling Meraki instances by default?
If you are, I’d like to know if you’re experiencing the problem we are that LM has tried and failed to troubleshoot: For some random reason, a number of meraki instances disappear from LM during one discovery cycle and reappear during the next discovery cycle. This isn’t normally a big problem since they instances aren’t set to delete until 30 days. Normally, they’d just show back up and have a gap in data. However, in our casewe have a business need to have the instances disabled on discovery (we charge the customer per Meraki device we actively monitor). This means that instances that have been discovered and enabled for monitoring randomly disappear and reappear as disabled instances. Also, any customer instance level properties that were added to the instance also are not present on the rediscovered instance. In the last 3 hours, there have been 3,577 instances (out of somewhere around 18,000) that disappeared and reappeared in this way. The problem was so pervasive that I had to write a script to loop through all instances and enable them based on a master billing sheet.Solved350Views14likes11CommentsPSA: LM wipes good known properties when unknown results occur
I have recently found that due to the excellent programming skills in the dev team that properties that have previously been autodiscovered can be wiped out when ephemeral issues produce unknown (no data) results. A good example is system.ips -- if the data has been scanned properly in the past and a blip occurs with no data, the previous values get overwritten with just the configured IP of the device. That leads to various fun side effects like NetFlow data not being matched to the device. To make things worse, the “no data” result does not set an internal flag to run a new AD scan earlier and you have to wait up to 24 hours for a regularly scheduled scan. I created a bug ticket requesting they set that flag and run a new scan as soon as possible, but was basically told to pound sand. My workaround was to use an undocumented API endpoint to trigger on specified devices so I stop losing NetFlow data and I scheduled it hourly. The “solution” I was given was to add a netflow property to hardcode the needed IP address for each device -- works, but it is a brittle fix and leads to undesirable manual property management. Beyond that, this issue affects more than NetFlow, that was just the problem that lead me to realize what was happening. Other properties routinely get messed up that could affect processing. This class of problem (replacing good data with unknown data) frequently occurs in modules as well -- for example, a lot of the Powershell configsource modules lack sufficient error checking and unknown results replace previously known good results, leading to change thrashing. Or they often forget to sort/normalize output leading to similar effects. The good news on those is they usually (eventually) listen to me. Anyone who wants to use my workaround can use this script (or at least the central logic if you prefer something other than Perl). I still lose data, but the window is smaller. https://github.com/willingminds/lmapi-scripts/blob/master/lm-action73Views2likes7CommentsPotential 187 UI bug
I tried to report this via Engineer chat, but no response. We have 187 early access. Looks like there may be a critical UI bug relating to the new Instance Group functionality… If you select an instance and then the Instance Group, the UI does NOT update to show the overview graphs etc. Please ping me via email to review the issue in our early access portal on a Zoom meeting.67Views3likes3Commentsheads up - property corruption due to unknown results
I wanted to share this with everyone since it bit me recently. Sometimes LM will reset the value of properties like system.ips to just the single IP associated with the resource if something disrupts access to SNMP even briefly. The problem with this is it can impact other features, like Netflow binding. I am still battling this out with support but in the meantime I wrote a script to trigger AD for specified devices (had to use an undocumented endpoint) and I schedule that hourly to limit the damage in case it happens (normally AD is triggered once per day unless specific changes occur). My change logs show system.ips resetting fairly often, so the script is definitely helping. I explained to support that thwacking data due to an unknown result is a bug and no property should be changed due to that situation, but I imagine this bug will be hard to unwind. I also recommended a partial fix where AD could be triggered by a “host up” event to limit the damage, but that is a hack. Avoiding data corruption in the first place is the right fix. FWIW, the main loop of my script is below (it is in an ancient language, but the logic should be clear :)). DEVICE: for my $d (@{$devices}) { if ($DEVICES{lc $d->{displayName}}) { if ($ACTION eq "scheduleAutoDiscovery") { verbose 1, "scheduling AD for $d->{displayName}\n"; if (not $lmapi->post(path => "/device/devices/$d->{id}/scheduleAutoDiscovery")) { warn "ERROR: $ACTION failed\n"; } } else { die "ERROR: unsupported action: $ACTION\n"; } } }39Views5likes0CommentsDifferent clock format on alerts (resource)
Alerts in UIV3 shows them in 24H format,but in the new UIV4 it shows them in 12H AM/PM format. This happens when viewing alerts on a resource or instance, but when going to the alerts page all alerts are in 12H format on both versions. This is from UIv3 (Resources page) This is from UIv4 (Resources page) -H38Views0likes1Comment497 days and counting........
You might have received an alert saying your linux based device has just rebooted, but you know that it has been up a long time. A switch might have just sent an alert for every interface flapping when they have all been up solidly. The important question to ask here is how long has the device been up? If its been up for 497 days,994 days,1491 days or any multiple of 497 then you are seeing the 497 day bug, that hits almost every linux based device that is up for a good length of time. Anything using a kernel less than 2.6 computes the system uptime based on the internaljiffies counter, which counts the time since boot in units of 10 milliseconds, or jiffies. This counter is a 32-bit counter, which has a maximum value of 2^32, or 4,294,967,296. When the counter reaches this value (after 497 days, 2 hours, 27 minutes, and 53 seconds, or approximately 16 months), it wraps back around to zero and continues to increment. This can result in alerts about reboots that didn’t happen and cause switches to report a flap on all interfaces. Systems that use 2.6 Kernel and properly supply a 64 bit counter will still alert incorrectly when the 64 bit counter wraps. A 32 bit counter can hold4,294,967,295( /4,294,967,295864000/8640000 = 497.1 days) A 64 bit counter can hold18,446,744,073,709,551,615 . (18,446,744,073,709,551,615/8640000 =2135039823346 days or 5849424173 years) Though I expect in 6,000 million years we will all have other things to worry over.106Views0likes5CommentsUptime - Bug Report
I’d like to re-initiate this bug report. The Uptime resetting counters at 497 days or 469 days (historical) I just had a similar false alarm telling me that my devices rebooted, when they did not. Please have the DEV team review this specific monitor and determine how the system can display 497+ days “uptime” --------------------------------- ________________________________________ SEP 11, 2015 | 01:56PM CDT Original message ________________ wrote: Support team at logic monitor, Is it possible to request adjustment to the "Uptime" data source monitor so that it does not alert when the counter resets from 11111111111111111111111111111111 to 00000000000000000000000000000001 The developer was aware enough of the event cause to code explanation in to the system alert: could the alert be altered to not-alert when the counter resets? - ________________ From: ________________ Sent: Friday, September 11, 2015 2:44 PM To: Subject: SC# Error: 6348 ________________ is reporting it has only been up for 0.43 minutes Hello ________________ , We have received the following monitoring alert and a ticket #6348 has been created to track your issue. An engineer is assigned and is working to resolve this issue. Thank you. We are investagating if the VM really did reboot or if this alert is coming up for a different reason: ________________is reporting it has only been up for 0.43 minutes, as of 2015-09-11 14:28:48 EDT. If this was an unexpected reboot, please investigate the system logs. NOTE: if ________________has been up for 469 days without a reboot, this alert will trigger due to a counter wrap in the host. In this case, you may disregard this alert. (But the host is probably due for an OS update.) For any inquiries please contact our NOC at support@highpoint.com<mailto:support@highpoint.com> or call 1-855-485-8324 (TECH). Regards, ________________ NOC Support Engineer26Views0likes1CommentCloned JDBC Datasource doesn't preserve DB credentials
Hello, The title says it all really. I cloned a custom JDBC datasource which already had credentials filled in, but the credentials weren't preserved. I wasn't told about this and had to go digging in the collector logs: [06-03 07:19:34.455 UTC] [MSG] [DEBUG] [pool-29-thread-29::collector.jdbc:Task:155185:db-server.some.where:PostgresServer-5432:jdbc:166] [JDBCTask._collect:223] Create JDBC connection, CONTEXT=timeout=10s, url=jdbc:postgresql://db-server.some.where:5432/, username=, passwordLength=0 [06-03 07:19:34.463 UTC] [MSG] [ERROR] [pool-29-thread-29::collector.jdbc:Task:155185:db-server.some.where:PostgresServer-5432:jdbc:166] [JDBCTask._collect:231] Failed to create JDBC connection, CONTEXT=timeout=10s, url=jdbc:postgresql://db-server.some.where:5432/, username=, passwordLength=0, EXCEPTION=FATAL: no PostgreSQL user name specified in startup packet org.postgresql.util.PSQLException: FATAL: no PostgreSQL user name specified in startup packet at org.postgresql.Driver$ConnectThread.getResult(Driver.java:341) at org.postgresql.Driver.connect(Driver.java:264) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:247) at com.santaba.agent.collector3.jdbc.JDBCTask._collect(JDBCTask.java:228) at com.santaba.agent.collector3.DataCollectingTask.execute(DataCollectingTask.java:123) at com.santaba.agent.collector3.CollectingTaskExecutionRunable.run(CollectingTaskExecutionRunable.java:20) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I've a href="https://communities.logicmonitor.com/topic/751-debugging-information/" rel="">mentionedthis already, but it would be pretty useful to have the above lines in a popup window or error log right in the web UI, rather than digging into the collector logs.5Views0likes0Comments