@Eric Singer, @Ryan B, @PatrickATL,
I've got an update on this, I appreciate your patience.
First off, I want to note that neither the collector nor any DataSources are explicitly calling vSAN methods, whether you have it installed and enabled or not. We ship the vSAN SDK with newer collector versions, but there aren't any core DataSources using it just yet.
The vSAN queries that brought down @Eric Singer's device appear to be triggered on the server when we call methods to get hardware and version information about a given host. This is based off seeing the same behavior both with the official VMware SDK, and the opensource YAVIJAVA SDK. This means we can't avoid making at least some inadvertent calls to vSAN unless VMware changes this, but we do have a mitigation route.
Specifically, these three things can trigger the calls:
-
Auto Properties identifying your ESX host and updating version info (this runs infrequently and doesn't generate many calls, likely not an issue)
-
VMware_vSphere_HostPerformance's AD script - This is the biggest offender, and kicks off about 30 vSAN calls in our test environment. A fix is in the work, but it won't be backwards compatible with the current version as the instance names will change. The fix currently only triggers 5 vSAN calls for each AD run when applies directly to ESX.
-
VMware_vSphere_HardwareSensors AD script - Only triggers once per call, likely not an issue
The effect is larger when the modules are applied to ESX directly. When those modules are applied to vCenter, some vSAN calls are still made on the host, but not as many (1-4).
Based on the great info we got from @Eric Singer and VMware, we're confident that the changes to VMware_vSphere_HostPerformance will sufficiently mitigate this issue.
We haven't yet been able to reproduce the crash in our lab by rebooting and forcing AD repeatedly.
@PatrickATL, I appreciate the offer. You might check /var/log/hostd.log for floods of calls to vSAN. Luckily the conditions for a crash seem fairly difficult to come by.
I will update this ticket when the fix is released and make sure our Customer Success team gets this info to the other 6.5 users. In the meantime, you should consider disabling VMware_vSphere_HostPerformance on ESX hosts you expect to reboot; you can still safely monitor them through vCenter. Expect a fix early next week.
Thanks again for your help on this. Please reach out if you have additional questions or concerns.