For more than 20 years we have been using Layer 3 connectivity powered by dynamic routing protocols to route traffic between data centers, but adoption of virtualization and geo-clustering technologies is forcing us to re-examine our data center interconnect (DCI) models.
New data center: The virtual blind spot
Harnessing the power of virtualization allows organizations to view and treat their compute resources as a global resource pool unconstrained by individual data center boundaries. Resources can span multiple buildings, metro areas or theoretically even countries. This basically means you can increase collective compute power in a data center that needs it by "borrowing" the resources from a data center that has spare capacity at the moment. This task is achieved by moving virtual machines between the data centers.
All major virtualization vendors, such as VMware, Xen and Microsoft support the concept of virtual machine live migration, where you can move live VMs from one physical host (server) to another without powering them down or suffering application connectivity break (there is a short pause, but not long enough for TCP sessions to be torn down).
Now the question is, what happens to the network settings -- specifically IP address/Subnet Mask/Default Gateway -- of the VM when it moves from Data Center A to Data Center B?
The answer is they remain the same. Well, to be precise, they remain the same when live migration is performed. If, however, the VM is powered down in Data Center A, copied in the down state to Data Center B and then powered up, the server administrator will have to change the IP address on the operating system running inside the VM to match the settings required at the destination Data Center B.
This, however, is not a very elegant solution, because it will require all connections to be re-established, let alone possibly create an application mess due to IP address change, since we all know how developers like to use static IP addresses rather than DNS names. So for the sake of our discussion, we are keeping the same IP address while live VMs move between data centers.
On the network level, now both source and destination data centers need to accommodate the same IP subnets where VMs are located. Traditionally having the same IP subnet appear in both data centers would generally be considered a misconfiguration or a really bad design. Consequently, it also means that Layer 2, aka VLANs, need to be extended between these data centers and this constitutes a major change in the way traditional data center interconnection had been done.
The other development that is forcing us to re-examine our data center interconnect models is geo-clustering, which involves use of existing application clustering technologies while positioning the servers, members of the cluster, in separate data centers. The biggest rationale behind doing this is to achieve very quick Disaster Recovery (DR). After all, it only takes cluster failover to resume the service out of the DR data center.
The failover usually means the standby cluster member, the one in the DR site, takes over the IP address of the previously active cluster member to allow TCP sessions survivability. At the network layer it again translates to the fact that both data centers hosting cluster members need to accommodate the same IP subnets and extend VLANs where those clustered servers are connected.
Some of the clustering solutions, for example Microsoft's Server 2008, actually allow cluster members to be optionally separated by Layer 3, meaning that they do not have to belong to the same IP subnet and reside in the same VLAN. These solutions, however, rely on Dynamic DNS updates and the DNS infrastructure to propagate new name-to-IP address mapping across the network to reflect the IP address change of the application once cluster failover occurs. This introduces another layer of complexity and punches a hole in the concept of using geo-clustering for quick DR.
What we can do
Now that we understand why Data Center Interconnect is morphing, lets look at the technologies in network designer's toolkit that can help solve the puzzle of IP subnets and VLAN extension across multiple data centers.
* Port Channel: Data centers can be interconnected using port channeling technology, which can run on top of dark fiber (due to distances, copper cabling is highly unlikely), xWDM wavelengths or Ethernet over MPLS (Layer 2) pseudo-wires. Port Channels can be either statically established or dynamically negotiated using LACP (IEEE 802.3ad) protocol.
* Multi-Chassis Port Channel: Multi-Chassis Port Channel is a special case where port channel member links are distributed across multiple switches. The added value is, of course, that the port channel can survive an entire switch failure. One popular implementation of this technology is from Cisco, using Virtual Switching System (VSS) on Catalyst 6500 series switches or a virtual Port Channel (vPC) on the Nexus 7000 and 5000 series switches. Nortel has a similar implementation called Split Multi-Link Trunking (SMLT). Multi-Chassis port channel can run over the same set of technologies as the traditional port channel and can also be either statically established or dynamically negotiated.
With both Port Channels and Multi-Chassis Port Channels, VLAN extension is achieved by forwarding (also referred as trunking) the VLANs between data centers across the port channel. Without special configuration, Spanning Tree Protocol is extended by default, merging STP domains of both data centers.
This is most often an unwanted outcome because issues, such as infamous Layer 2 loops, in one data center can propagate across and impact the other data center. Methods of filtering Bridged Protocol Data Units or BPDUs are usually employed to isolate STP and thus fault domains between the data centers. Media access control (MAC) reachability information is derived through flooding of unknown unicast traffic across the port-channeled links, which is common and not particularly efficient way of learning MAC addresses in a transparent bridging environment. The use of Port Channels as DCI is relatively simple and intuitive, however it suffers from poor scalability beyond two data centers and as such does not fit well in larger DCI deployments.
* Ethernet over MPLS: This is the oldest of the pseudo-wire technologies in which an MPLS backbone is used to establish a logical conduit, called a pseudo-wire, to tunnel Layer 2 -- in this case Ethernet -- frames across. EoMPLS is also sometimes referred to as a Layer 2 VPN.
Layer 2 Ethernet frames are picked up on one side, encapsulated, label switched across the MPLS backbone, decapsulated on the other side and then forwarded as native Layer 2 Ethernet frames to their destination. Frames can also include 802.1q trunking tag if you want to transport multiple VLANs across the same pseudo-wire.
EoMPLS pseudo-wires make both sides appear as if they were connected with a long-reach physical cable. If you're thinking "that sounds similar to Port Channels", you are right. Setting aside the MPLS backbone in between, EoMPLS pseudo-wires share similar characteristics with the Port Channels-based DCI solution we talked about earlier.
Just like with Port Channels, BPDUs are forwarded by default across the pseudo-wires merging STP domains, and BPDU filtering can be employed to prevent that. MAC learning is still done through flooding, so EoMPLS does not change that inefficient concept.
In fact those two technologies can even be layered on top of each other and, as we briefly mentioned before, at times Port Channels are built across the EoMPLS pseudo-wires to deliver DCI connectivity and VLAN extension.
If an MPLS backbone is too much for you to handle, EoMPLS can run on top of regular IP backbone using GRE tunneling. Keep in mind that MPLS label exchange still occurs across the GRE tunnels, so by using EoMPLSoGRE we now have one more protocol layer to troubleshoot and account for, but the up side is that there is no MPLS backbone to maintain.
The use of GRE tunneling also has implications on the Maximum Transmission Unit (MTU) size needed to be supported across the IP backbone, since the use of GRE protocol adds 24 bytes of overhead (20 bytes of outer IP header + 4 bytes of GRE header) per each packet in addition to the encapsulated MPLS label stack size.
* VPLS: Virtual Private LAN Services extends EoMPLS by allowing multipoint connectivity, which is achieved through a set of pseudo-wires running between VPLS Provider Edge (PE) routers. Pseudo-wires endpoints can either be statically defined or automatically discovered using Multi-Protocol BGP (MP-BGP).
VPLS provides STP domain isolation by default, which is an improvement over EoMPLS and Port Channels DCI, however, achieving edge redundancy with VPLS is no easy task and network designers need to be crafty to make sure that inter-data center loops are broken.
VPLS brings no good news about MAC address learning, which is still achieved by flooding unknown unicast traffic throughout the network across the pseudo-wires, however, once properly tuned, VPLS provides a quite effective data center interconnect.
Just as EoMPLS, VPLS has a VPLSoGRE variant for non-MPLS environments and just like EoMPLSoGRE, it adds 24 bytes of GRE overhead when compared with traditional VPLS, so interface MTU needs to be properly planned across the backbone.
One more interesting point: in case MP-BGP is used to automatically discover pseudo-wires endpoints, GRE tunnels are still needed to be manually created, which undermines the advantages of using MP-BGP in VPLSoGRE deployment.
Two proprietary approaches
All Layer 2 Data Center Interconnect technologies discussed so far are industry standards. Let us now look into two innovative proprietary technologies from Cisco.
* Advance-VPLS: A-VPLS is another variant of VPLS technology, but introduces several new properties, which make it stand out. First, it mitigates the difficulties of providing DCI Edge redundancy without utilizing any fancy scripting mechanisms. To achieve that, A-VPLS builds on top of Cisco's Virtual Switching System (VSS-1440) available in Cisco Catalyst 6500 switches.
Second, it utilizes port channel hashing techniques, which take into calculation Layer 2, Layer 3 and Layer 4 information to determine DCI Edge switch outbound interface. This allows excellent traffic load-sharing between data centers over multiple DCI links.
Third, as packets traverse DCI links, A-VPLS introduces optional MPLS flow labels to further improve traffic load-balancing through the label switched core.
Fourth, it significantly simplifies user configuration, which intuitively resembles configuring a trunk interface on Cisco switches with an addition of a very few MPLS related commands. There is no longer a need for per-VLAN VFI configuration, which is a huge time saver and has lower chances of human error.
As far as underlying network connectivity prerequisites, A-VPLS can run over diverse transports, with three clearly identifiable options: a) Layer 2 Core or no Core at all, b) Layer 3 Label Switched Core and c) Layer 3 non-Label Switched Core.
In case of Layer 2 Core, A-VPLS can run on top of technologies, such as EoMPLS and even VPLS (yes, A-VPLSoVPLS). If the network is simplified not to have a DCI Core at all, then Dark Fiber or xWDM wavelength can be used for back-to-back connectivity between A-VPLS PEs. The Layer 3 Label Switched Core option can make use of traditional MPLS VPN service interconnecting A-VPLS sites, in which case certain label exchange will happen between MPLS VPN PEs and A-VPLS PEs (for Loopback interfaces of A-VPLS PE routers).
Finally, a Layer 3 non-Label Switched Core makes use of GRE tunnels created between all participating A-VPLS PEs, while label switching occurs inside the GRE encapsulation. Mimicking VPLS behavior, Spanning Tree environments are automatically isolated between the data centers to limit Layer 2 fault domain and prevent one data center from being impacted by STP issues in another.
MAC address learning is again done through unknown unicast flooding, after all, A-VPLS is a VPLS variant and that behavior does not change. Flooded traffic does consume some DCI bandwidth, however MAC address tables will be populated very quickly before this traffic causes any concern, so this is normally a non-issue. Even with current specific requirements of Catalyst 6500 switches (Sup720) and SIP-400 line cards to make use of A-VPLS technology (this will change with time), it is an excellent choice for efficient Layer 2 DCI.
* Overlay Transport Virtualization: OTV is an emerging technology that is unlike any other data center interconnect solution we discussed so far. You can call it an evolutionary protocol, which utilizes lessons learned from past DCI solutions and integrates them into the inherent protocol operation. It is transport agnostic and can be equally supported over IP or MPLS backbone, which gives it a great versatility.
Cisco calls the underlying concept of OTV traffic forwarding "MAC routing", since it behaves as if you are routing Ethernet frames over the DCI transport. OTV uses a control plane protocol to proactively propagate MAC address reachability before traffic is allowed to pass, which eliminates dependency on flooding mechanism to either learn MAC addresses or forward unknown unicasts.