vSAN all hosts down scenario

 

The worst case scenario in a VMware vSAN cluster is all hosts down. A situation where no sysadmin wants to find themselves in. Panic & frustration quickly follow suit. Despite all the safety features built into vSAN it is designed to tolerate failures within it’s failure domains, not an entire vSAN cluster outage.

Scenario

Unsaid client was in the process of setting up a VDS on an existing VSAN cluster. Mistakenly selected the vSAN vmkernel adapters on all hosts for migration to the VDS while the cluster was in operation. Upon deploying this change it instantly took down the entire 4-node, 14TB vSAN cluster. All VMs down, vSAN data store showing as 0KB. To add to the mix, the customers vCenter VCSA was also down because it was also hosted on the vSAN which made it even more difficult to view the overall health of the environment.

  • vSphere 6.5 environment
  • vSAN total failure, non-stretched, single host failure domains
  • All vSAN VMs down including vCenter VCSA
  • 4-node cluster vSAN
  • Hybrid disk groups (1 flash, 2 HDD per host)
  • NumberOfFailuresToTolerate=1

Disaster Recovery

This is a cluster network total failure. This results in a complete network partition of vSAN where each host will reside in its own partition. To each isolated host, it will look like all the other hosts have failed. Since no quorum can be achieved for any object, no rebuilding takes place. Once the network issue is resolved vSAN will try to establish a new cluster and components will start to resync. Components are synchronized against the latest, most up to date copy of a component.

Continue reading…