Forum Discussion

Kelemvor's avatar
Kelemvor
Icon for Expert rankExpert
23 days ago

All my collectors are going down every 24 hours.

Hi,

Starting last week Thursday, all my collectors, across two different LM portals, are going down approximately every 24 hours.  Apparently the Watchdog is telling the Agent to reset itself which causes everything to die for 5-10 minutes.

It's happening on almost every collector we have.  Some do it every day, some skip a day here and there.  They ALL started this last week Thursday.  We didn't make any global changes and have no idea what the heck happened.

Is anyone else dealing with this?

There are entries like this that show the Watchdog service told the main Agent server to restart itself.

[2025-05-06 20:05:47.876 GMT] [MSG] [CRITICAL] [statusmonitor:::] [StatusListener$1.run:135] Peer request to shutdown, CONTEXT=CAUSE=shutdown cmd, ACTION=quit
[2025-05-06 20:05:47.876 GMT] [MSG] [CRITICAL] [statusmonitor:::] [StatusListener$1.run:151] Shutting self down by quit with 0, CONTEXT=MSG=all sockets closed, System.exit(0) now
[2025-05-06 20:05:47.882 GMT] [MSG] [INFO] [statusmonitor:::] [RestartUtil._reportEvent:236] Reported restart reason successfully, CONTEXT=type=ReceivedShutdown, reason=Collector receives shutdown command from watchdog. Agent will restart.
[2025-05-06 20:05:47.883 GMT] [MSG] [INFO] [statusmonitor:::] [RestartUtil._saveRestartReason:292] Save restart reason successfully, CONTEXT=file=C:\Program Files (x86)\LogicMonitor\Agent\conf\restart.conf
[

These are the tickets that they generated showing that it delays a few minutes each day, but is happening almost by clockwork.

 

 

 

I opened a ticket via Chat but was told that something is overload the agent and we need to up the collectors sizes.  This doesn't really tell what happened last Thursday that started causing the problem so I'm posting here wondering if anyone is having the same issue.

Thanks.

7 Replies

  • It is normal for the collector to restart itself every 24 hours. I also think it cycles its encryption key with the portal at the same time. But it should just take seconds for that to happen, not enough to cause any gaps or alerts. So having it take 5-10m to finish the restart is very odd, I've never seen any do that before.

    I would suggest reviewing the agent logs on the collector from before Thursday and after, during the period when it normally restarts itself, to see if something changed. Worth checking Event/system logs for that period to see if the OS is seeing problems restarting the service. I would also try manually restarting the LogicMonitor service and see if even doing it by hand takes it minutes to come back up. I would also check if there is something else that might be running at the same time as the 24hr restart. Like you mentioned the collector restarts itself after ~24hours from when it was first started, perhaps it has started shifting into a period of high cpu load? or perhaps a backup snapshot pauses the VM right at that moment? etc.

  • Kelemvor​ did you ever get this resolved?

    Mike_Moniz​ 's post is accurate. We do cycle the collector credential every 24 hours, but it should be quick.

    Have you installed or updated any new LogicModules?

    Have you checked the Health Check output on your collectors?

    Have you taken a look at the Collector DataSources to see if anything is overloaded? If support was alleging that, I assume they would have pointed to some indicator. It is true that an overloaded collector can cause long restart intervals, but it should be confirmable by looking at the above.

    • Kelemvor's avatar
      Kelemvor
      Icon for Expert rankExpert

      We have not gotten this figured out.  It all started on May 1.  We were wondering if LM rolled out any changes on that day since we started getting flooded by possibly every collector we have.   We have an open ticket with Support but don't know what to do to figure out the problem.  The one collector had a higher BatchScript count, but it even happens on collectors that don't use any scripts at all.  We're at a loss to explain it so far.

      • Mike_Rodrigues's avatar
        Mike_Rodrigues
        Icon for Product Manager rankProduct Manager

        My deployment calendar doesn't show any deployments on May 1. So unless the change coincided with a portal upgrade, that's likely not it.

        Any chance you updated modules and pulled in one that's running long or something?

        Did you check the output of the collector health check?

        Any chance you added Antivirus or something to the OS, outside of the collector?

  • This definitely appears to be related to the daily credential sync/restart thing that happens.  About half of our collectors work fine.  The other half have had at least one Collector Down alert over the last few weeks, when the service didn't start back up in time.

    Here's the Collector Events from one that fails almost every day.  Note that it takes 13-14 minutes from the Restart to the Up entries.

    Here's one that works about half the time.

    So far no one has been able to find anything that might cause the delays.  They're all going to get rebooted on Monday for Patching so maybe that will resolve something...

  • What collector version and did and a collector version upgrade coincide with this issue?

  • Just a few random thoughts..

    Do the agent start stop times coincide with the Application log events ? Are there patches pending installation/reboot. This can kill machine performance. If you manually spin the LM service does it take as long to reconnect ? What happens in wrapper.log after a service spin. Is your firewall blocking one of the LM endpoints, that was a fun one to solve. Is the FW setup as company.lm.com or just lm.com