Disaster strikes as NAS3 crashes

September 5, 2017
/
Home Lab
/
No Comments

This past weekend we had a power brownout for about 4 hours. This caused my servers to fail-over to battery power. The batteries don’t last long with servers running. I guess something went sour with the automatic shutdown of my NAS3 which is used only for my VMware virtual machines and it did an improper shutdown. The RAID has crashed.

I don’t have anyone to blame other than myself and I knew eventually this day would come. NAS3 was in RAID-0. That means striping with no redundancy. A failed array on RAID-0 typically means total data loss. I take daily backups of this entire NAS nightly so I am aware and prepared for the risk of using striping. That does not mean that it’s a fun time recovering from it.

Adding additional redundancy for blackouts

Currently, one of the hardest things to recover from in my current home-lab environment is a total power blackout. Everything right now is planned & designed around losing certain components like 1 disk, 1 switch/network cable, etc. However when everything is off and I need to bring things back online it’s a painstaking and very manual process. Over time my environment has also become more and more complex. This latest outage has me scratching my head at how to recover faster & simpler from a power blackout.

Planned changes:

Back to standard switching. The VDS (Virtual Distributed Switch) did not want to recover due to some checksum error. So even after I was able to get some VMs back online I couldn’t get there vNICs working. I had to run a cable and setup a temp VSS network. I assume this was because of the improper shutdown of the host and the lack of vCenter to coordinate the recovery of that data. I only run on a single host so the VDS is overkill anyway. Moving back to VSS will make networking much more simplified and easier to recover from a blackout scenario.
vCenter on local host storage. I made the mistake of hosting my vCenter (VCSA) on my NAS3. While I still had backups this alone greatly increases recovery difficultly.
DNS on local host storage. The loss of my local DNS also caused various issues and increased recovery difficulty.
RAID-5 on NAS3. Adding some RAID redundancy to NAS3 on-top of the daily backups will make me less worried about when blackouts or improper shutdowns occur. It was too fragile. A loss of performance for a gain of redundancy might work a bit better longer term. RAID-0 is always a bad time when things go bad.
Auto-startup/shutdown testing. I need to reconfigure and test the automatic startup and shutdown procedures for power-loss.
Hardware Firewall. Moving the firewall to hardware instead of virtual will help in a recovery as well.
Add another UPS. I didn’t have enough battery time to figure out what was going on before systems were going down. Splitting the load across another UPS will help.

During one of my days off I will need to basically take the lab down to reconfigure most of these. I think this will help a great deal. At the time of this previous failure I didn’t really have the patience to spend 3 hours recovering a bunch of stuff just to get the internet working.

Future Planned Upgrades

I am planning on adding a Synology DS1817+ to the home-lab in the next year. This will give me much greater raw capacity as I am currently almost full on NAS1. Having 8-bays instead of five will give me more options for RAID, higher capacity and performance. The option to add 10G to the DS1817+ will also future proof this NAS for years to come.

As always, thanks for reading. Leave any comments or questions below.