Forum Discussion

Eric_Singer's avatar
7 years ago

FYI: LM can trigger ESXi 6.5 hostd to crash

Hi,

I just got done working with VMware support on an issue where our ESXi 6.5 hostd process would crash during a booting phase.  We eventually traced it back to a bug in some vSAN code that LM monitoring is polling..  It doesn't matter if you're running vSAN in your environment or not.  Our work around has been to disable host level monitoring in LM for our ESXi hosts for now and it's been stable ever since.

The expected fix is scheduled for release in Q3 2018 from VMware.

12 Replies

Replies have been turned off for this discussion
  • @Eric Singer - Any chance VMWare provided you with a KB that documents this as a known issue / bug?  I'd like to provide as much context as possible to our ESX admins.

    Thanks!

  • No KB that i'm aware of.  Their RCA was...

     

     

    Good Morning!

    Here is the root cause our Engineering has identified,
    Looking at the threads in hostd, we see that there are lots of threads blocked on the lock of the host managed object.
    11 threads (threads 12, 14, 15, 16, 17, 18, 19, 20, 21, 26, 27) were blocked trying to read-lock the host.

    The thread that holds the read lock is thread 2. It is blocked in some vsan.

    A code in the GetRuntime() property decided to perform some RPC operations and blocked waiting on a condition variable. This caused a deadlock.
    This depends on whether the event that the vsan stub was waiting for would be generated from an I/O thread (in which case the thread would eventually be unblocked), or the event needed a worker thread to be generated (in which case it would be a deadlock by thread starvation).

    As the root cause for the bug is that a piece of VSAN code which is causing a deadlock, our Engineering is working with vSAN team to get the insight of the respective property.