I'm a huge fan of server virtualization for mixed purpose hosting. It's not a perfect fit for every situation but it's very versatile. The portability of guest operating systems adds some level of improvement to availability and recovery over bare metal out of the gate and, with a little work, you can boost the robustness of your hosting environment considerably.
There are two topics I want to discuss that you should be considering for your virtualized environment: Host node clustering and Shared storage. In Windows terminology these are referred to as a High Availability Cluster and Clustered Shared Volumes.
High Availability Cluster
A High Availability Cluster is a group of 2 or more bare metal servers which are used to host virtual machines. The server nodes (physical machines) work together to provide redundancy and failover to your virtual machines with little to no downtime on the VMs. They can also be used to maximize your server hardware by allocating VMs to the node with the lowest current workload.
A Hyper-V cluster is established by installing the Failover Cluster Role to each server node in the group. You then use the Failover Cluster Management tool to create your cluster and join server nodes to it.
It's basically as easy as it sounds, but there are a couple of key requirements and decisions that need to be made before you establish your cluster. First, you need a whole bunch of NIC's in each server node, the recommended minimum is 4:
- #1 - WAN connection
- #2 - Cluster Heartbeat
- #3 - Live Migration
- #4 - Shared Storage Network
You may want even more so that you can enable MPIO on your storage network and potentially have a dedicated management NIC to your bare metal server. In my case, I went with 4 but I used 2 for the SAN with MPIO and I combined the Cluster and Live Migration into a single NIC which has worked without issue. The WAN, SAN, and Cluster NICs should each be on different networks/subnets.
Second, you need to figure out your storage solution, which is the next topic of discussion. In order for a cluster to be effective, each node needs to be able to access the same storage location(s) simultaneously. This is achieved using a Clustered Shared Volume or CSV.
Clustered Shared Volume
A CSV is a disk or pool of disks which is accessible by each node as if it were a logical disk on the system. There are a variety of configurations to accomplish this and it's an absolutely critical piece of the puzzle. The shared storage system is the foundation of a good virtualized environment - and it must be rock solid.
When establishing a CSV, the two most common configurations are an iSCSI LUN and the new SMB 3.0 storage protocol. There is a lot of old information out on the web regarding VM storage that no longer applies today. It makes finding the right recommendations tricky, but if you're using Windows 2012 or later, you can consider SMB 3.0 or an iSCSI setup with a single LUN (maybe an extra LUN for the Quorum) as the right options. There are some compelling reasons to choose SMB 3.0, especially if you need to have flexible scale-out storage capability. The latest advances in the protocol have brought performance to nearly the same level as direct attached storage which is crazy.
Regardless of which route you choose, the functional requirement is the same. Each node in the cluster should be able to connect to the storage volume simultaneously. This allows you to have a common storage location for the VM disk and machine configuration which can be passed to another node in the event of a node failure, without the need for manually mounting a volume or copying files. Normally allowing simultaneous connections to a volume like this would result in data conflict and corruption, but in a HA cluster this is accounted for by way of a coordinator node and a quorum disk.
Failover and High Availability
Once you have your shared storage in place and your nodes joined to a cluster, you're ready to migrate your virtual machines into the cluster and make them highly available. You can migrate a VM to a cluster the same way as you migrate them to any Hyper-V host, just choose a host that's part of the cluster.
With a VM running on your cluster and its disk resources hosted on your CSV, you can now add the VM to the cluster under the Virtual Machine Role. Doing so will add failover capability for that VM.
In a failover scenario, one node will lose the heartbeat signal from another node which has become offline. The coordinator node will then transfer the ownership of the connection to the VM that was running on the offline node to another node which is still online, and that new node will now host the VM. The process can take a minute, but there will be no need to copy the VM disk anywhere since all nodes are connected to the same storage volume. Usually an end user will notice little to no connection interruption to the VM being failed over.
Another useful scenario that HA Clusters provide is something called Cluster Aware Updating. With this feature enabled, each node will take turns running windows updates and rebooting to complete the process while automatically migrating VMs around the cluster to keep everything online. It's a pretty nice feature, but one I've been too scared to enable so far.
Weak Points
A High Availability Cluster is a good start to adding some failover to your virtual environment, but of course there are many points of failure still remaining. The biggest one is the shared storage solution. If that were to go offline all the cluster nodes in the world can't help you. That also means that everything between the nodes and the storage volume is a point of failure as well, the switch, the network cables, and the NICs themselves. The only real way to protect against these things is to have two of everything, but the complexity increases greatly. One step at a time though unless you have deep pockets. With our shallow ones, we keep a cold spare of key components so that we can at least minimize the impact of a critical hardware failure.
This story, "What is a Windows Hyper-V High Availability Cluster?" was originally published by ITworld.