Forum Discussion

pperreault's avatar
6 years ago

collector fail count

One of our collectors is experiencing what seems to be connectivity issues. Common symptoms are it loses communication with the LM cloud, remote sessions to it or monitored devices fail to complete. I also notice that the collector heartbeat fail datapoint is increasing with time. I've seen it's value over 6000. Support hasn't been able to tell me what this value actual is other than providing developer notes, which are unfortunately unhelpful.  Can anyone provide some insight to what this failure count is actually counting? Has anyone seen and resolved this symptom?

We are planning on rebuilding the host server and recreating the collector.

  • I don't know about the heartbeat fail datapoint other then what the description says "Number of failed attempts to execute the heartbeat task" but what I've setup is for all our collectors to ping LM (x.logicmonitor.com), ping 8.8.8.8 and each collector pings all the other collectors. It helps us determine if for example the internet is down vs LM SaaS itself is down vs VPN down vs internal networking issues.

    Perhaps it might even make sense to temporarily add the collector server as a resource a 2nd time but have another collector monitor it. But if you have the option to just rebuild the server and collector, that might just be the simplest option.