Forum Discussion

David_Bond's avatar
David_Bond
Icon for Professor rankProfessor
21 days ago

Azure Autobalanced Collectors - 5 minute windows of ping failures

Here's a weird one.  We have a customer with three auto-balanced collectors in Azure.  They see the following pattern for ping loss (0 is good):

  • 12:34: 0
  • 12:35: 0
  • 12:36: 100
  • 12:37: 100
  • 12:38: 100
  • 12:39: 100
  • 12:40: 100
  • 12:41: 0
  • 12:42: 0
  • 12:43: 0

So regularly (6 or 7 times a day for ALL Resources), there are 5 minute windows where the ping loss is 100%.  PRTG (admittedly on a different Azure subnet) is showing no issues whatsoever.

So... what's to blame?

9 Replies

  • Is it just Ping? Or other datasources as well?
    Have you identified if its a random collector or always the same collector?
    Is there a pattern with the times? 

    • David_Bond's avatar
      David_Bond
      Icon for Professor rankProfessor

      Just ping, and no apparant pattern.  I find it difficult to believe that LogicMonitor is at fault, but PRTG (also running on the same Azure subnet) is not affected.  It's such as strange pattern.  Why would separate sets of ten pings, one minute apart ALL be affected and ALWAYS for 5 minutes.  If I didn't know better, I would suspect some weird ARP or routing issue in the Collector's Windows Server OS, but that doesn't seem right either.

      Utterly baffled.

  • Just Ping.  There is seemingly no pattern.  The weird thing is that it's always in 5 minute blocks, but the Ping DataSource polls 10x ICMP send/response every minute, so these are independent measurements.  I cannot believe that it's LogicMonitor's fault UNLESS it's related to the auto-balancing, but that doesn't seem right either.

    I wondered if it was something that anyone else had seen as being a problem in Azure networking environments, perhaps an oddity of routing?  I'm at a complete loss.

    • Dave_Lee's avatar
      Dave_Lee
      Icon for Advisor rankAdvisor

      Not specific to Azure, but we've seen issue occasionally where a collector fails to interpret the results of the PING check.  Jumping onto the controller itself and running a ping works fine (as in, using the OS ping utility) but the collector software doesn't seem to be able to do it.  In situation is different though, the collector reservices must to be restarted to fix it.

  • We've moved PRTG to the same subnet and again, PRTG does not suffer from this issue. It seems to be isolated to LogicMonitor.

    • LMPatrickA's avatar
      LMPatrickA
      Icon for Employee rankEmployee

      When you look at the Collector device that is monitoring the devices where the PING loss is happening, I would recommend checking the Collector Data Collecting Tasks datasource to see if there is some sort of cyclical overload of those tasks happening. Specifically I would look at the graphs for Unavailable Thread Scheduling and Queue datapoints before, during and after the Ping loss time periods. It may indicate that the collector ABCG is not keeping up with Ping specifically, which is something you may be able to tune around. 

  • I can't find the discussion but I recall there was some issue brought up before about the way that Java/LM does ping would not not increase some packet id # which would cause some routers/firewalls to block them eventually, and restarting the Collector would clear it. There are some situations where I've had to restart a collector to fix a ping issue. But it's always been that ping stops working completely until the collector restart and not just stop for a few minutes then fix itself.

    Is there anything special about the routing between the collector and a device that stops pinging? Does it happen even if you are pining 127.0.0.1 (like pinging the collector itself)? Are the devices affected all on the same region/network/subnet? Is special routing table or 3rd party router/firewall between them? Perhaps worth running Wireshark or network watcher during the issue (if you can catch it live) and see what the traffic looks like. What does the LM wrapper/sbproxy log show if anything? etc.