Forum Discussion

mark_rowlands's avatar
3 years ago

SQL performance graphs stopped working after collector upgrade to 31.002

Shame on me for only just noticing ? but all of our SQL performance graphs stopped working after the upgrade to 31.002 , the data is being collected ok with Microsoft_SQLServer_DAtabases ds , ( mostly ) but all of the graphs are stubbornly blank  , We are going to upgrade our collectors later this week to 31.003 so I will hold off creating a case until then but if anyone else has noticed this issue might be nice to know, specially if you have a solution

5 Replies

  • We saw similar issues with Hyper-V metrics, might be a similar root cause -- see comments from our ticket below.  We have found this problem in general with LM since we started working with it -- lack of data collection is silently ignored as a rule.  You can find various ways to deal with it, and LM tries to add meta-methods like troubleshooter datasources, but the overall system for this is very fragile.  The right fix would be to allow for longer term analysis of data and to make 'unknown' be a first-class alert option like critical, warning and error.  You could then write one rule for all datapoints looking for (say) 1 hour of unknown results and be done with it.  Both of those features would have much more benefit beyond this use case.  Suffice to say, I filed both of those years ago as feature requests and you can guess how that went :).  Our next best solution is to analyze dashboards for widget problems.  You will notice that when a graph widget has no data, LM says that in the display.  We struggled for a bit with how to get that data and I finally did figure it out.  Now our backup script has a "check" option that scans a bunch of things for problems, including widgets.  So we know within a day if something has stopped reporting.  Could be faster if we schedule checks more often, but a day is far better than "when you happen to see it broken during a client review meeting."

    Mark

    The collector GD31.003 now has the fix and you can update your collector to the v.31.003. Please test and let us know if the issue is fixed for you.
    
    https://www.logicmonitor.com/support/gd-collector-31-003
    
    Fixes
    
        Issues were reported for batch script data collection due to multiple or no periods in the formation of instance name and datapoint name. For example, in the datasource the key value pair has multiple periods – keyvalue(##WILDVALUE##.storage.totalsize.filesystem) and therefore it failed.
    
        We have fixed this issue. A datapoint now will not fail even if it has multiple periods or no periods.
  • May want to only upgrade one collector to test first.  We had to downgrade our collectors back to 30.002  that were running any type of SQL query datasources.  They weren't polling at all on 31.003.  Since we downgraded we at least are getting partial polls at the moment(a handful each hour instead of every 3 minutes).

  • I am curious...how are you polling dashboard widgets to find the failed polls?

  •   

    13 minutes ago, Jeff8682 said:

    I am curious...how are you polling dashboard widgets to find the failed polls?


    This is done via check functions within our backup script bound to REST API paths.  We also added a mechanism to suppress warnings for known issues via a text widget in the dashboard (either permanently or with an expiration date).  I don't have expanded checks for every widget type, but the big ones do, like cgraph.  The main check for widgets that mirrors what you see on the screen is below.  The reference to $warning_suffix relates to the suppression logic mainly.  We get the widget_data for additional analysis (e.g., check for gappy graphs -- "at least 20% missing data and spread out over at least 10 samples").  Since it can be a bit intensive, we only run the check logic once per day.  We also check for netscan policy problems (since there are a few cases where integrity constraints are not applied in the UI, like when a collector is removed) and user groups (to validate all users are in one, for example).

            if ($widgettype !~ /^(?:text|html|flash|alert)$/) {
                # get recent data from supported widget types, bail out on exceptions
                if ($widgettype eq "gmap") {
                    # must fetch all results for gmap
                    eval { $widget_data = $LMAPI->get_all(path => "/dashboard/widgets/$widgetid/data", start => 1000*(time-1800)); };
                }
                else {
                    eval { $widget_data = $LMAPI->get_one(path => "/dashboard/widgets/$widgetid/data", start => 1000*(time-1800)); };
                }
                if ($@) {
                    my $error = $@;
                    $error =~ s/^raw request:.*//sm;
                    $error =~ s/.*:\s+\d+\s+//sm;
                    $error =~ s/\n+/ /gsm;
                    warn "$COMPANY: $dashboardname: $widgetname\[$widgettype\]: widget exception: $error$warning_suffix\n";
                    return;
                }
                elsif (not defined $widget_data) {
                    warn "$COMPANY: $dashboardname: $widgetname\[$widgettype\]: unable to load widget data$warning_suffix\n";
                    return;
                }
            }

     

  • Thanks might have to look into implementing something like this for our environment.