Hi, I just got done working with VMware support on an issue where our ESXi 6.5 hostd process would crash during a booting phase. We eventually traced it back to a bug in some vSAN code that LM monitoring is polling.. It doesn't matter if you're running vSAN in your environment or not. Our work around has been to disable host level monitoring in LM for our ESXi hosts for now and it's been stable ever since. The expected fix is scheduled for release in Q3 2018 from VMware.

Hey @Eric Singer, thanks for bringing this to our attention. We've got our Collector Team looking into how to mitigate this now. We're also working to identify customers monitoring ESXi 6.5 so we can notify them proactively. I will update this thread as we learn more.

Does this only effect ESXi hosts directly added; or ESXi hosts monitored underneath a vCenter added to logicmonitor?

On 5/21/2018 at 1:13 PM, Ryan B said: Does this only effect ESXi hosts directly added; or ESXi hosts monitored underneath a vCenter added to logicmonitor? Only for hosts directly added

Do you mind posting a screenshot or a list of the datasources you have applied to your hosts? Also what build of 6.5 are you running?

@Michael Rodrigues we monitor a few 6.5 hosts and have been fortunate enough not to see this issue yet, ping me if it would help to compare them.

FYI: LM can trigger ESXi 6.5 hostd to crash

12 Replies

Replies have been turned off for this discussion

Mike_Rodrigues
Product Manager
8 years ago
Hey @Eric Singer, thanks for bringing this to our attention. We've got our Collector Team looking into how to mitigate this now.

We're also working to identify customers monitoring ESXi 6.5 so we can notify them proactively.

I will update this thread as we learn more.
Ryan_B
8 years ago
Does this only effect ESXi hosts directly added; or ESXi hosts monitored underneath a vCenter added to logicmonitor?
Eric_Singer
8 years ago

On 5/21/2018 at 1:13 PM, Ryan B said:

Does this only effect ESXi hosts directly added; or ESXi hosts monitored underneath a vCenter added to logicmonitor?

Only for hosts directly added
PatrickATL
8 years ago
Do you mind posting a screenshot or a list of the datasources you have applied to your hosts? Also what build of 6.5 are you running?
PatrickATL
8 years ago
@Michael Rodrigues we monitor a few 6.5 hosts and have been fortunate enough not to see this issue yet, ping me if it would help to compare them.
Mike_Rodrigues
Product Manager
8 years ago
@Eric Singer, @Ryan B, @PatrickATL,

I've got an update on this, I appreciate your patience.

First off, I want to note that neither the collector nor any DataSources are explicitly calling vSAN methods, whether you have it installed and enabled or not. We ship the vSAN SDK with newer collector versions, but there aren't any core DataSources using it just yet.

The vSAN queries that brought down @Eric Singer's device appear to be triggered on the server when we call methods to get hardware and version information about a given host. This is based off seeing the same behavior both with the official VMware SDK, and the opensource YAVIJAVA SDK. This means we can't avoid making at least some inadvertent calls to vSAN unless VMware changes this, but we do have a mitigation route.

Specifically, these three things can trigger the calls:

Auto Properties identifying your ESX host and updating version info (this runs infrequently and doesn't generate many calls, likely not an issue)

VMware_vSphere_HostPerformance's AD script - This is the biggest offender, and kicks off about 30 vSAN calls in our test environment. A fix is in the work, but it won't be backwards compatible with the current version as the instance names will change. The fix currently only triggers 5 vSAN calls for each AD run when applies directly to ESX.

VMware_vSphere_HardwareSensors AD script - Only triggers once per call, likely not an issue

The effect is larger when the modules are applied to ESX directly. When those modules are applied to vCenter, some vSAN calls are still made on the host, but not as many (1-4).

Based on the great info we got from @Eric Singer and VMware, we're confident that the changes to VMware_vSphere_HostPerformance will sufficiently mitigate this issue.

We haven't yet been able to reproduce the crash in our lab by rebooting and forcing AD repeatedly.

@PatrickATL, I appreciate the offer. You might check /var/log/hostd.log for floods of calls to vSAN. Luckily the conditions for a crash seem fairly difficult to come by.

I will update this ticket when the fix is released and make sure our Customer Success team gets this info to the other 6.5 users. In the meantime, you should consider disabling VMware_vSphere_HostPerformance on ESX hosts you expect to reboot; you can still safely monitor them through vCenter. Expect a fix early next week.

Thanks again for your help on this. Please reach out if you have additional questions or concerns.
Mike_Rodrigues
Product Manager
8 years ago
We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN
Michael_Ouimet
8 years ago

On 6/5/2018 at 6:36 PM, Michael Rodrigues said:

We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN

On 6/5/2018 at 6:36 PM, Michael Rodrigues said:

We've released an updated version of VMware_vSphere_HostPerformance. It breaks backwards compatibility with the version 1 series. It also only applies to vCenter by default, to further mitigate vSAN calls triggered directly on the host. When applied directly, AD now triggers 5 vSAN calls as opposed to 30.

If you want to keep the historical data before upgrading, you can rename version 1.x of the DataSource and then disable it.

The locator code to get version 2 is 99EKKN
Michael_Ouimet
8 years ago
How do we disable the version 1 of the datasource?
Brandon
Neophyte
8 years ago
DO NOT comment out the applies to field on the datasource! This will remove all historical data - which I can only imagine most of us want to keep. You can disable the datasource by creating a device group (if you don't have one already) and populating it with all of the ESX hosts. Then, at the group level, select the alert tuning tab and uncheck the box next to the datasource. This disables polling and alerting, but allows you to keep historical data.

Forum Discussion

FYI: LM can trigger ESXi 6.5 hostd to crash

12 Replies

Recent Discussions

Dashboard Sharing – An Inline Framing Method

2021-12-15 US Office Hours

Live Training - Tuning Datapoints and Alerts - 15th JUNE 2022 - APAC

Live Training - Introduction to Dashboards - 18th MAY 2022 - APAC

2022-05-11- APAC Product Overview -Collectors, Resources/Groups, Dashboards