Save to My DOJO
Many environments running VMware hypervisor run their server with vSphere 7 SD cards. It is a great way to install the OS without wasting space on the hard drives or SSD, which you can then dedicate to VMFS or VSAN datastores. The same is true with a USB card that can act as your vSphere boot disk. vSphere on SD card was a reliable way to operate your hypervisors as most enterprise servers offer RAID 1 implementations for it.
We don’t often cover technical problems and bug fixes in the blog but this one felt appropriate as it impacts a large number of customers and was potentially harmful to production environments.
In this blog, we will cover the issue and what to do to avoid it.
The vSphere SD Card Issue
What happened?
vSphere 7u2 started rolling out in March 2021 and brought with it a bug that took many customers by surprise.
The situation is explained in KB83376
In a nutshell, those who are using vSphere on an SD card may encounter cases where the boot device would suddenly become inaccessible. The host would go in Not Responding state in the vSphere Client shows alerts such as “Alert: /bootbank not to be found at path ‘/bootbank’”.
/bootbank is the location where the vSphere image is stored.
When it happens, the virtual machines keep running but no operation can be performed as the hypervisor is crashed. The only way to fix it would be to restart the host and all the VMs running on it as you cannot migrate them. Note that by doing that, you are merely resetting the timebomb timer and it will eventually crash again.
What is the root cause?
The root cause of the problem seems to be due to a race condition in the ESXi storage stack where some operations might not get to the device. In which case they will be queued in the stack and eventually time out.
From what we understand, this would be related to the new vSphere 7 partitioning layout of the boot disk that aims at improving IOs among other things. As a result, the load generated by certain actions such as VMware Tools related ones, overwhelm low-performance SD/USB storage devices.
“ESXi host with vSphere install on sd card becomes not responding”
Note that this kind of issue isn’t new and it is surprising that it wasn’t picked up during testing.
VMware describes the root cause as follows:
-
- “Device disconnection happened on USB hardware on ESXi. This is seen on xHCI controller that, when commands fail, and USB bus reset happens after one retry it will lead to all USB device (including USB SD card) reconnection. When USB boot device is reconnected, ESXi host may not be able to release path resource and will consider a new device is plugged in and give a new path. So from ESXi host, it shows boot device is lost.”
Up until now, the only way to prevent it from happening again was to revert to vSphere 7 Update 1 or apply a workaround to reduce the load on the vSphere sd card by using esxcli to move the VMware Tools repository from the SD card to RAMdisk.
Changes in vSphere partition layout
Before moving on with the vSphere sd card issue, I wanted to quickly touch base on the changes in the partition layout of the boot disk that occurred in vSphere 7.0. This change was done mainly to improve the flexibility of the boot disk partitions.
vSphere 6.x partitions
In vSphere 6.x, the number of partitions and their sizes were fixed which could restrict the support for large modules and debugging opportunities. As you can tell from the diagram below, some partitions would be created according to the size of the media.
“vSphere 6 system storage layout with fixed partition sizes and number”
vSphere 7.x partitions
In order to improve the flexibility of the boot disk partitions and increase performance, VMware has consolidated the layout into fewer dynamically sized partitions.
“Only high-performance medium must be used with vSphere sd cards”
The biggest change being the new ESX-OSData partition that is divided into 2 categories.
-
- RAM: Frequently written data such as logs, traces, vSAN EPD, live databases…
-
- ROM: Infrequently written data such as VMTools ISO, configuration, core dumps…
That new ESX-OSData is now formatted with VMFS-L (VMFS-Local) and is aimed at storing modules, system configuration/state, and system virtual machines.
Updating to vSphere 7u2c
It took VMware a few months to finally release the much-awaited patch to fix this issue as no sensible IT professional enjoys applying operations labeled as “workarounds”.
You can find the fix in the release notes of vSphere 7 Update 2c.
“vSphere 7u2c includes a fix for the vSphere sd card issue”
-
- If you use VMware vSphere install on the SD card and you have been impacted, we recommend that you update your ESXi hosts using the lifecycle manager as soon as possible.
-
- If you haven’t updated to vSphere Update 2 yet and you are still running vSphere 7 Update 1, make sure that Lifecycle Manager is up to date and that you upgrade to vSphere 7u2c at the minimum.
-
- If you don’t want to update just yet, VMware should bring some changes to permanently fixe this in Update 3:
“Starting with ESXi Update 3, a VMware Tools partition is automatically created on the RAM disk and you see warnings to prevent you create partitions other than the boot bank partitions on flash media devices.”
“The vSphere 7u2c update is dated on the 24/08/2021”
Boot Disk Recommendations
The change in partition layout was, in part, driven by the fact that the cost of SSD has gone down significantly in the last 10 years. Because of it, many hardware vendors are shifting away from SD cards in favor of persistent storage such as SSD or NVMe, often around the 120GB mark. If it is the case for you then you have nothing to worry about, especially if you avoid consumer-grade drives.
However, if you are still going for vSphere on sd card anyway, VMware highly recommends that you use high-endurance and high-performance cards.
Review the following resources to ensure you have a stable system:
-
- You can review the Storage Requirements for ESXi 7.0 in the VMware documentation.
Note that the official recommended install options for vSphere 7.0 are the following:
-
- A local disk of 138 GB or larger.
-
- A device that supports a minimum of 128 Terabytes Written (TBW).
-
- A device that delivers at least 100 MB/s of sequential write speed.
-
- RAID 1 mirrored device is recommended.
In a nutshell, the takeaway of all this boils down to the following:
“For new installations, we strongly recommend using high performance and high endurance devices like M.2, SSDs etc. which is 32GB or greater”
What Does This Bug Mean for the Future of vSphere 7?
VMware described vSphere 7 as one of the biggest releases since the launch of the hypervisor as it brought things like Tanzu and cloud integration that draw much of the public’s attention. However, it feels like each vSphere update fixes an issue but also brings a new one. This is the case for vSphere 7u2a that fixed an upgrade bug (Failed to load crypto64.efi) but introduced the vSphere sd card problem.
While it is always surprising to find serious bugs like these in core areas that have reached maturity many years ago such as the boot device, VMware was fairly quick to provide a workaround and release a patch.
Moving forward, configuring a vSphere ESXi host to use sd card shouldn’t be your first choice. If you don’t want to deal with issues like this one, it is recommended to install your hypervisors on SSD or NVMe devices.
Not a DOJO Member yet?
Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!