Disaster strikes as NAS3 crashes

This past weekend we had a power brownout for about 4 hours. This caused my servers to fail-over to battery power. The batteries don’t last long with servers running. I guess something went sour with the automatic shutdown of my NAS3 which is used only for my VMware virtual machines and it did an improper shutdown. The RAID has crashed.

I don’t have anyone to blame other than myself and I knew eventually this day would come. NAS3 was in RAID-0. That means striping with no redundancy. A failed array on RAID-0 typically means total data loss. I take daily backups of this entire NAS nightly so I am aware and prepared for the risk of using striping. That does not mean that it’s a fun time recovering from it.

Adding additional redundancy for blackouts

Currently, one of the hardest things to recover from in my current home-lab environment is a total power blackout. Everything right now is planned & designed around losing certain components like 1 disk, 1 switch/network cable, etc. However when everything is off and I need to bring things back online it’s a painstaking and very manual process. Over time my environment has also become more and more complex. This latest outage has me scratching my head at how to recover faster & simpler from a power blackout.

Continue reading…

Home Lab Updates: AC Unit, Failed Drive on NAS1

 

I’ve been meaning to make a post about all the recent changes to my home lab but I’ve been quite busy. I’ve also done some more work on the backend of the website to help speed things up. I’m also, slowly, working on a new design for vSkilled as well.

The biggest update I have right now is that I’ve finally ordered a portable air conditioning unit for my home lab. It’s starting to get warmer again since summer is around the corner and I don’t want the house to be ridiculously warm. I ordered the Honeywell 12,000 BTU MN12CES. Once I have the unit installed I’ll try and put up another post with a write up and pics!

Continue reading…

vSAN all hosts down scenario

 

The worst case scenario in a VMware vSAN cluster is all hosts down. A situation where no sysadmin wants to find themselves in. Panic & frustration quickly follow suit. Despite all the safety features built into vSAN it is designed to tolerate failures within it’s failure domains, not an entire vSAN cluster outage.

Scenario

Unsaid client was in the process of setting up a VDS on an existing VSAN cluster. Mistakenly selected the vSAN vmkernel adapters on all hosts for migration to the VDS while the cluster was in operation. Upon deploying this change it instantly took down the entire 4-node, 14TB vSAN cluster. All VMs down, vSAN data store showing as 0KB. To add to the mix, the customers vCenter VCSA was also down because it was also hosted on the vSAN which made it even more difficult to view the overall health of the environment.

  • vSphere 6.5 environment
  • vSAN total failure, non-stretched, single host failure domains
  • All vSAN VMs down including vCenter VCSA
  • 4-node cluster vSAN
  • Hybrid disk groups (1 flash, 2 HDD per host)
  • NumberOfFailuresToTolerate=1

Disaster Recovery

This is a cluster network total failure. This results in a complete network partition of vSAN where each host will reside in its own partition. To each isolated host, it will look like all the other hosts have failed. Since no quorum can be achieved for any object, no rebuilding takes place. Once the network issue is resolved vSAN will try to establish a new cluster and components will start to resync. Components are synchronized against the latest, most up to date copy of a component.

Continue reading…

Reducing Home Lab Power Usage

I have come to the conclusion that in 2017 I will need to down scale my home lab in order to reduce power & cooling usage.  It’s grown year over year and unless I start making changes it’s not going to start going down.

My plan is to beef up VMH02 with more RAM so that it can handle the full load of the VMs. Then I will have VMH01 powered-off in stand-by mode. This way only one of the ESXi hosts are running at a time but can still quickly spin up when needed using VMware power control with IPMI if needed. This should reduce my power usage in the lab significantly, especially because both of my ESXi servers are dual CPU socket – they love to eat up power. Having only one of the servers running should make a huge difference. I have never used VMware power management before so I am both curious and excited to make use of it. Continue reading…

Tuning Large Windows DHCP Servers

I’ve been involved in setting up some very large Windows DHCP deployments during my time working as a Consultant at Long View Systems. Along the way I’ve found some interesting challenges and caveats of using Windows DHCP, especially so anytime your working with DHCP enabled dynamic DNS updates. I wanted to have a quick post about this for my own reference and hopefully might come in handy for others as well.

  • DHCP Failover Scopes
  • Administration Overhead
  • DhcpLogFilesMaxSize
  • DynamicDNSQueueLength
  • DnsRegistrationMaxRetries

DHCP Failover Scopes

I’ve covered this topic extensively in my Windows Server 2012 R2 – DHCP High Availability / Fail-over Setup Guide series. Basically, if you are deploying Windows DHCP on a 2012+ server then you should be using DHCP Failover (not to be confused with split-scope or ms-clustering).

Administration Overhead

If you’re working with more than 100 scopes using only the default DHCP MMC-snap in’s, you’re gonna have a bad time.

Almost 1,000 DHCP scopes, 150k+ IP addresses

Performing administration tasks in the console with a large number of scopes becomes very repetitive and time consuming as each task normally requires many clicks. Making mass-changes is also very difficult or next to impossible. You may find yourself becoming familiar with Powershell scripting to resolve this problem. The DHCP Server Cmdlets in Windows PowerShell are very easy to use and Microsoft has great documentation on this. I found myself making Powershell scripts to make mass-changes much easier and less vulnerable to human error due to the very repetitive nature of the default GUI. Continue reading…

Migration from Cisco 1000v to VMware Virtual Distributed Switch (Part 2)

home_network3

This is part 2 of a series. Click here to see Part 1. I apologise for taking so long to get Part 2 posted. Sometimes I just don’t have the time or effort I would like to have with the blog.

000193_2015-10-29 10_06

This portion of the guide focuses on the second half of the VSS to VDS migrations. We needed to move the VMs to a VSS so that you can migrate both VMs and hosts to the new vCenter cleanly. Then we will be moving the VMs back to a VDS from their VSS configuration.

Keep in mind this migration is being done LIVE with production virtual machines running on the hosts. Obviously, this must be executed carefully or you will have a lot of explaining to do. Do not make these changes without understanding the full impact to your environment. Continue reading…