Disaster strikes as NAS3 crashes

This past weekend we had a power brownout for about 4 hours. This caused my servers to fail-over to battery power. The batteries don’t last long with servers running. I guess something went sour with the automatic shutdown of my NAS3 which is used only for my VMware virtual machines and it did an improper shutdown. The RAID has crashed.

I don’t have anyone to blame other than myself and I knew eventually this day would come. NAS3 was in RAID-0. That means striping with no redundancy. A failed array on RAID-0 typically means total data loss. I take daily backups of this entire NAS nightly so I am aware and prepared for the risk of using striping. That does not mean that it’s a fun time recovering from it.

Adding additional redundancy for blackouts

Currently, one of the hardest things to recover from in my current home-lab environment is a total power blackout. Everything right now is planned & designed around losing certain components like 1 disk, 1 switch/network cable, etc. However when everything is off and I need to bring things back online it’s a painstaking and very manual process. Over time my environment has also become more and more complex. This latest outage has me scratching my head at how to recover faster & simpler from a power blackout.

Planned changes:

  1. Back to standard switching. The VDS (Virtual Distributed Switch) did not want to recover due to some checksum error. So even after I was able to get some VMs back online I couldn’t get there vNICs working. I had to run a cable and setup a temp VSS network. I assume this was because of the improper shutdown of the host and the lack of vCenter to coordinate the recovery of that data. I only run on a single host so the VDS is overkill anyway. Moving back to VSS will make networking much more simplified and easier to recover from a blackout scenario.
  2. vCenter on local host storage. I made the mistake of hosting my vCenter (VCSA) on my NAS3. While I still had backups this alone greatly increases recovery difficultly.
  3. DNS on local host storage. The loss of my local DNS also caused various issues and increased recovery difficulty.
  4. RAID-5 on NAS3. Adding some RAID redundancy to NAS3 on-top of the daily backups will make me less worried about when blackouts or improper shutdowns occur. It was too fragile. A loss of performance for a gain of redundancy might work a bit better longer term. RAID-0 is always a bad time when things go bad.
  5. Auto-startup/shutdown testing. I need to reconfigure and test the automatic startup and shutdown procedures for power-loss.
  6. Hardware Firewall. Moving the firewall to hardware instead of virtual will help in a recovery as well.
  7. Add another UPS. I didn’t have enough battery time to figure out what was going on before systems were going down. Splitting the load across another UPS will help.

During one of my days off I will need to basically take the lab down to reconfigure most of these. I think this will help a great deal. At the time of this previous failure I didn’t really have the patience to spend 3 hours recovering a bunch of stuff just to get the internet working.

Future Planned Upgrades

I am planning on adding a Synology DS1817+ to the home-lab in the next year. This will give me much greater raw capacity as I am currently almost full on NAS1. Having 8-bays instead of five will give me more options for RAID, higher capacity and performance. The option to add 10G to the DS1817+ will also future proof this NAS for years to come.

 

As always, thanks for reading. Leave any comments or questions below.

Karl has been involved in the virtualization, server, web development and web hosting industry for over 15 years. In his current role at a managed service provider, he is focused on cloud-based solutions for enterprise clients. His diverse background of sales, management, and architectural/technical expertise bring a unique perspective to the virtualization practice.