Disaster strikes as NAS3 crashes

This past weekend we had a power brownout for about 4 hours. This caused my servers to fail-over to battery power. The batteries don’t last long with servers running. I guess something went sour with the automatic shutdown of my NAS3 which is used only for my VMware virtual machines and it did an improper shutdown. The RAID has crashed.

I don’t have anyone to blame other than myself and I knew eventually this day would come. NAS3 was in RAID-0. That means striping with no redundancy. A failed array on RAID-0 typically means total data loss. I take daily backups of this entire NAS nightly so I am aware and prepared for the risk of using striping. That does not mean that it’s a fun time recovering from it.

Adding additional redundancy for blackouts

Currently, one of the hardest things to recover from in my current home-lab environment is a total power blackout. Everything right now is planned & designed around losing certain components like 1 disk, 1 switch/network cable, etc. However when everything is off and I need to bring things back online it’s a painstaking and very manual process. Over time my environment has also become more and more complex. This latest outage has me scratching my head at how to recover faster & simpler from a power blackout.

Continue reading…

vSAN all hosts down senario

 

The worst case scenario in a VMware vSAN cluster is all hosts down. A situation where no sysadmin wants to find themselves in. Panic & frustration quickly follow suit. Despite all the safety features built into vSAN it is designed to tolerate failures within it’s failure domains, not an entire vSAN cluster outage.

Scenario

Unsaid client was in the process of setting up a VDS on an existing VSAN cluster. Mistakenly selected the vSAN vmkernel adapters on all hosts for migration to the VDS while the cluster was in operation. Upon deploying this change it instantly took down the entire 4-node, 14TB vSAN cluster. All VMs down, vSAN data store showing as 0KB. To add to the mix, the customers vCenter VCSA was also down because it was also hosted on the vSAN which made it even more difficult to view the overall health of the environment.

  • vSphere 6.5 environment
  • vSAN total failure, non-stretched, single host failure domains
  • All vSAN VMs down including vCenter VCSA
  • 4-node cluster vSAN
  • Hybrid disk groups (1 flash, 2 HDD per host)
  • NumberOfFailuresToTolerate=1

Disaster Recovery

This is a cluster network total failure. This results in a complete network partition of vSAN where each host will reside in its own partition. To each isolated host, it will look like all the other hosts have failed. Since no quorum can be achieved for any object, no rebuilding takes place. Once the network issue is resolved vSAN will try to establish a new cluster and components will start to resync. Components are synchronized against the latest, most up to date copy of a component.

Continue reading…

Migration from Cisco 1000v to VMware Virtual Distributed Switch (Part 2)

home_network3

This is part 2 of a series. Click here to see Part 1. I apologise for taking so long to get Part 2 posted. Sometimes I just don’t have the time or effort I would like to have with the blog.

000193_2015-10-29 10_06

This portion of the guide focuses on the second half of the VSS to VDS migrations. We needed to move the VMs to a VSS so that you can migrate both VMs and hosts to the new vCenter cleanly. Then we will be moving the VMs back to a VDS from their VSS configuration.

Keep in mind this migration is being done LIVE with production virtual machines running on the hosts. Obviously, this must be executed carefully or you will have a lot of explaining to do. Do not make these changes without understanding the full impact to your environment. Continue reading…

vExpert 2016

vmware_hyk

I am very honoured to be selected as a vExpert 2016 by VMware. Getting recognition is awesome but knowing that you are sharing content that is for the benefit of others is even better.

The annual VMware vExpert title is given to individuals who have significantly contributed to the community of VMware users over the past year. The title is awarded to individuals (not employers) for their commitment to sharing their knowledge and passion for VMware technology above and beyond their job requirements.

vExpert’s benefits and activities receive:

  • vExpert certificate
  • Permission to use the vExpert logo on cards, website, etc for one year
  • Access to a private directory for networking, etc.
  • Exclusive gifts from various VMware partners
  • Access to private betas (subject to admission by beta teams)
  • 365-day eval licenses for most products
  • Private pre-launch briefings
  • Private briefings from tier 1 alliance partners
  • Blogger early access program for vSphere and some other products
  • Featured in a public vExpert online directory
  • Access to vetted VMware & Virtualization content for your social channels.

I give thanks to the other vExperts and the VMware social media & community team for their hard work and dedication.

The full list of the 2016 vExperts can be found here.

VMW-LOGO-vEXPERT-2016-k

Automatically reboot an ESXi host after PSOD

VMWare_ESXi_PSOD

Anyone who has worked in a VMware environment for any length of time should be quite familiar with this purple diagnostic screen, or what we like to call the “purple screen of death“.  Even VMware themselves internally reference this setting as “BlueScreenTimeout”, so make no mistake where it’s fathered it’s name. This PSOD screen is what will appear when the ESXi host goes into an unresponsive state.

Note: The default and VMware recommended setting is to leave the host in an unresponsive state with the purple diagnostic screen displayed on the console screen to aid in troubleshooting.

There are some exceptions to VMware’s recommendation on this, mainly for environments or situations where we simply don’t care about what or why the host had a PSOD. We just need it rebooted and be back online and working as soon as possible. Especially if you are using remote syslog on the ESXi host (which you should) this PSOD screen is of trivial importance and just forces manual intervention to have the host rebooted from iLO/IPMI.

If appropriate for your environment lets set a ESXi host to automatically reboot after 120 seconds at the PSOD screen. There are three ways to do this. By SSH or using the “Advanced Settings” window from the vSphere client or vSphere web client.

Using SSH:

  1. Connect to the ESXi host via SSH
  2. Run command:
    • esxcfg-advcfg -s 120 /Misc/BlueScreenTimeout

The value is in seconds before the reboot will occur. Change this as desired.

Using vSphere Client:

  1. Select the host you wish to configure
  2. Go to the Configuration tab, select Advanced Settings
  3. From the Advanced Settings window select “Misc“.
  4. Find the “Misc.BlueScreenTimeout” value.
  5. Enter desired auto reboot time, in seconds.
  6. Click OK to save, and rinse and repeat for other hosts.

000212_2015-11-19 08_39

Using vSphere Web Client (5.x+):

  1. Select the host you wish to configure
  2. Select the Manage Tab. Select “Advanced System Settings”.
  3. Scroll down (or use the filter) to find “Misc.BlueScreenTimeout“.
  4. Click the Edit button. Enter the timeout value, in seconds.

000213_2015-11-19 08_48

Source: http://kb.vmware.com/kb/2042500

Installing open-vm-tools on CentOS

Centos-Logo

Just a quick post here today. This is in regards to installing open-vm-tools on CentOS. There’s no need to download and install separate epel-release files anymore as it’s now in the CentOS extras repo directly.

To install them, just use this command, then install open-vm-tools.

yum -y –enablerepo=extras install epel-release

– extras is enabled by default but the –enablerepo caters for those that have disabled it.

Enjoy!