vSAN all hosts down scenario

 

The worst case scenario in a VMware vSAN cluster is all hosts down. A situation where no sysadmin wants to find themselves in. Panic & frustration quickly follow suit. Despite all the safety features built into vSAN it is designed to tolerate failures within it’s failure domains, not an entire vSAN cluster outage.

Scenario

Unsaid client was in the process of setting up a VDS on an existing VSAN cluster. Mistakenly selected the vSAN vmkernel adapters on all hosts for migration to the VDS while the cluster was in operation. Upon deploying this change it instantly took down the entire 4-node, 14TB vSAN cluster. All VMs down, vSAN data store showing as 0KB. To add to the mix, the customers vCenter VCSA was also down because it was also hosted on the vSAN which made it even more difficult to view the overall health of the environment.

  • vSphere 6.5 environment
  • vSAN total failure, non-stretched, single host failure domains
  • All vSAN VMs down including vCenter VCSA
  • 4-node cluster vSAN
  • Hybrid disk groups (1 flash, 2 HDD per host)
  • NumberOfFailuresToTolerate=1

Disaster Recovery

This is a cluster network total failure. This results in a complete network partition of vSAN where each host will reside in its own partition. To each isolated host, it will look like all the other hosts have failed. Since no quorum can be achieved for any object, no rebuilding takes place. Once the network issue is resolved vSAN will try to establish a new cluster and components will start to resync. Components are synchronized against the latest, most up to date copy of a component.

High Level Steps:

  1. Resolve the network issues on the ESXi hosts
  2. Get hosts to un-partition, rejoin vsan cluster
  3. Wait for vSAN to start rebuilding the failed components
  4. Start up VMs

If you find yourself in this scenario my first recommendation is to take a breath, slow down, and remain calm. It’s going to take 4+ hours to recover from this outage. The data should not be lost.

VM Deadlock & ESXi host

All VMs running on the vSAN cluster remained powered on but were unable to read or write to their disk. The ESXi hosts also became over burdened with IO errors causing them to essentially lock up. Commands in SSH/shell taking many seconds or even minuets to return a response. Eventually we force rebooted some of the hosts. This helped by allowing us to troubleshoot on the host without massive delays. This also caused the host to reinitialize vSAN and un-network partition itself. As hosts rebooted the network configuration was fixed and they rejoined to the vsan cluster using esxcli vsan commands.

Dead vCenter

In this particular scenario the vCenter server was also impacted by the vSAN outage. While vSAN does not rely on vCenter Server for its normal operations. vCenter is required for configuration and management of the vSAN Cluster and Storage Policies. Without vCenter it is very difficult to diagnose and view the health of the cluster without the esxcli vsan commands.

We needed to deploy a new VCSA appliance to temporarily move the failed vSAN hosts into since the old vSAN data-store was still completely inaccessible. This allowed us to reconfigure the vSAN cluster properly. The new cluster was able recover vSAN configuration on the hosts and VMs were re-registered and brought back online. It took many hours for data to rebuild and the IOPS performance of the cluster greatly suffered.

Conclusion

vSAN certainly has many positives and the product continues to mature. This post is not to bash vSAN but to raise awareness on proper deployment configurations and to be aware of the caveats of using vSAN. Careful consideration is required for all vSAN deployments regardless of size. Ensure your failure domains are setup correctly. Ensure you know what will happen when you go over your FTT (NumberOfFailuresToTolerate). Have a plan! Have a backup! Test it.

We were able to recover 100% of VMs and data in about 10 hours of patience, troubleshooting and repairs. I had heard other horror stories of vSAN data loss. At certain times even I was starting to question my sanity as to whether we could actually recover this vSAN cluster.  I was pleasantly surprised that the customer had no data loss and this gives me a new respect for vSAN’s resiliency in such a failure scenario.

The customer has added more change control measures and does not host their vCenter on the vSAN cluster. 🙂

Karl Nyen has been involved in the virtualization, server, web development and web hosting industry for over 10 years. In his current role at a managed service provider, he is focused on cloud-based solutions for enterprise clients. His diverse background of sales, management, and architectural/technical expertise bring a unique perspective to the virtualization practice.

Leave a Reply

1 Comment

  1. Have to agree that changes to production environments, as well as incorrect changes can bring any environment down.

    These were not particular to vSAN though.

    Good to see that they were able to recover their data.

    Cheers,
    Jase