Everything You Ever Wanted to Know About Disaster Recovery Networking
“The replication of data and applications is meaningless without a proven Disaster Recovery strategy. Network and security are at the heart of every successful BC/DR plan, and implementing a Disaster Recovery strategy with- out ensuring end-users can consume applications in the same manner, and with the same security precautions as production, can put an organization at serious risk. ‘Everything You Ever Wanted to Know About Disaster Recovery Networking,’ provides a needed and comprehensive roadmap for organizations looking to ensure their Disaster Recovery environment is fully consumable during a DR event within their current network and security frameworks without needing to make major changes.”
• Danny Allan, CTO and SVP of Product Strategy at Veeam®
Introduction
Most strategies, service offerings, and software tools related to Disaster Recovery focus on the process by which critical server images and unstructured data are continuously replicated from the production site to a DR location. Ensuring that virtual and physical servers, as well as file, block, and object storage, can be copied to meet required RPOs is no easy task, and it often requires multiple tools all working together. One can feel quite accomplished when they can attest to the fact that all their servers can properly boot up at the DR site with all configurations and data intact.
For many organizations, the quest to achieve some sort of Disaster Recovery capability stops here. In reality, many administrators really don’t believe the DR site will ever be used. Having recent data and bootable servers are often considered “good enough.” In the unlikely event that it would ever be required, stakeholders, customers, and staff would be satisfied dealing with a perceived few more hours of manual configuration changes, tinkers, and tweaking to make the applications usable to them, even if in a degraded state.
However, this logic is flawed. Many administrators don’t appreciate how deeply entrenched their application delivery relies upon complex network and security configurations and policies. What good is having copies of your critical applications and data at the DR site if your users can't access them the same way they can from the production sites?
Replicating data is the easy part, tools like Veeam make it very straightforward to replicate virtual & physical servers and even NAS. The true make-or-break for your DR strategy all lies within the network.
Enough with the high-level context, let’s dive into some network plumbing.
Failover & Testing Scenarios
Before any DR strategy can be conjured up, you must first identify the type of failure events you’re trying to protect against:
Is it a full failure of the production site?
The ability to failover specific applications to the DR site while the rest runs from production?
Booting a single server at the DR site and integrating it back with production?
Failing over to the DR site to keep your applications running so you can apply weekly patches?
Or maybe you’d like to get fancy and run some sort of A/B configuration, switching back and forth between production and DR every 2 weeks.
My guess is you answered “all of the above”, which is typical.
The thing is, depending on the complexity and maturity of your existing network configuration, each of those “wants” can be increasingly challenging. We typically recommend a phased approach, starting with the full site failover capability, then the partial capability, and so on.
Let’s take a deeper look at each scenario and its unique network requirements.
Full Site Failover
The full site failover scenario is the easiest to protect against because it allows us to perform the network failover/failback at a macro-level. We don’t have to get into the weeds of the various internal networks or subnetting; it’s just a matter of failing “everything” over and “everything” back. The firewall and router configurations at the DR site can easily match production in almost every sense. Change-control is straightforward or automated, and we typically don't have to force any changes like IP renumbering at the production site to provide this capability. Thus, the full-site failover capability should be your easiest achievable goal when it comes to DR networking.
Additionally, in the context of internal network integration (which we’ll dive into deeper below), if the production site speaks to other sites via VPN or dynamic routing, we simply advertise the production IP space from the DR site, which is relatively easy and straightforward. Likewise, if you happen to be advertising your own publicly routable address space, you can easily advertise it from the DR site as part of your DR strategy.
Partial Site Failover
Here’s where things start to get interesting. Partial failover means the ability to failover a specific “application” to the DR site, while not touching the others running at production.
In the context of a DR conversation, we talk more about “applications” and not so much about servers. Application refers to a bundle of servers and unstructured data required for a single application to function properly. There is an assumption that shared dependency services (i.e. all of the networking, authentication, and security services required for the applications to function properly) are already present at the DR site, or will be booted prior to an application failover.
Most modern data movers, such as Veeam, will replicate servers in these “groupings” so that all of the data and servers within each bundle are consistent between themselves. Veeam refers to them as “Failover Plans”. Other tools also have similar constructs.
Here’s an example:
From a network perspective, if we wanted to failover 1 “application” and not the other, the network resources per application would need to be clearly segmented so that we could easily migrate them between the sites. For example, if we have an MPLS network between our locations running dynamic routing, we could advertise the more specific subnet for 1 application from the DR site while leaving everything else at production:
Under normal circumstances, production will advertise 192.168.1.0/24. If we need to failover “Accounts_Receivable_SW” to DR, we can advertise the more specific 192.168.1.0/25 from the DR site and leave production intact. Pretty clean and straightforward.
If this is how your production network is already configured, then hats off to you. You will have an easy time providing per-application partial failover.
However, if you’re like many of the customers I’ve helped, this is simply not the case. Most enterprises have a good amount of technical debt and legacy environments, including large subnets housing tens or hundreds of servers running multiple applications. This requires a different strategy, which we’ll get into below.
Single Server Failover
When it comes to DR, the goal should always be to minimize collateral damage and failover only the impacted services. Would you initiate a full site failover in response to ransomware on one server? Most folks are not only looking to failover specific applications, but they want the ability to failover single servers to DR. This makes sense. The idea behind this is to be as least disruptive as possible. However, the single server failover scenario is also the hardest to support and manage, which is why most people will settle for the middle ground of per-application failover/back, and for good reason, too. With a cleanly subnetted network, dynamic routing, and data movers or services like DRaaS this is automated and very straightforward.
However, If you still want to support a single server failover, it is possible and doable, but as you’d expect, it all comes down to the network.
The challenge with this requirement is that other servers within the same “application” bundle at the production site may sit within the same subnet as the server you wish to failover. The server will not know it's somewhere else. They’ll send out ARP requests on the local LAN to look for that server, and will not receive a reply -- there’s no routing to help us here. So, from a network standpoint, we need to trick production into thinking the server is still local. We’ll dive deeper into how we can accomplish that in the next section
Testing Scenarios
The last “scenario” I want to mention related to Disaster Recovery is testing. Your DR network strategy needs to be flexible enough to allow application owners to test apps at the DR site in a “meaningful” way.
Some organizations will designate a dedicated testing network. Servers at the DR site will come up with the same local IPs as they had in production, and a default gateway will exist there to route. NAT rules will exist between the testing network and the internal/production server IPs. Then, users can test the applications on the test IPs. This may involve DNS entries pointing to the test IP (i.e.instead of app.site.com, use app-drtest.site.com). Users can also set static host entries on their desktops to point the actual DNS name (app.site.com) to the test IPs. This works very well and is non-disruptive.
We’ve also helped customers set up dedicated site-to-site or site-to-client VPN accounts so that application owners can simply connect to the DR site and test against the replica as needed without any other changes. Some customers even have designated laptops or offices only set to connect to the DR VPN for dedicated testing.
Another item to consider here is dependencies. Does your application speak to other internal or external systems in order for it to function properly? For example, your app may make an API call to pull or push data to an online SaaS tool or internal app inside another part of your environment. How should this be dealt with in DR?
This is something we need to be incredibly cautious about. If your app has the potential to push data to other systems, the simple process of DR testing could cause data loss or corruption if the DR systems are allowed unchecked access to those 3rd parties. It’s easy enough to fence off the testing network, but what about the application? Is it actually testable without access to the 3rd party system? One strategy is to present or mimic the 3rd party tool or endpoint at the DR site so that the application functions. Even better, you can ask the 3rd party platform or application owner to provide a 2nd endpoint for DR testing only, possibly with a recent data dump from production. At that point, you just need to route traffic destined for the 3rd party production site to the sandbox or DR environment.
To get more specific, if the 3rd party system is an asset on the local LAN at production, you may need to sit a firewall at the DR site and perform some type of double-NAT configuration to mimic its local IP, then send it off to a sandbox environment, wherever that may live.
If the 3rd party system is a publicly-routable address such as a SaaS tool, there may be multiple options:
The SaaS may provide you with sandbox credentials, which use the same hostname/IP but with alternative API keys or passwords. This will require you to perform the redirect to the sandbox environment within your application, which may or may not be possible.
The SaaS may provide the sandbox on an alternative hostname/IP, which you could again redirect in the application’s configuration.
Relying on someone to remember to change a configuration file to prevent data corruption as part of DR testing can be dangerous. It may be possible to automate the process, too, but again it's not foolproof.
As an alternative, you could statically route traffic to the SaaS’s main hostname/IP at the DR site to a local proxy server, which would in turn utilize the alternative sandbox address or credentials on the backend. Although slightly complex, this method is great because it accomplishes multiple goals:
It ensures no traffic from the DR site to the 3rd party SaaS during testing, preventing data loss,
It allows the application to function properly so that it can be tested,
It utilizes the sandbox environment with no modifications to the application itself
And it provides a straightforward way to disable this during a live DR event by simply removing the static route
We’ll dive into a more detailed example of a complex testing scenario later on.
Internal Network Strategies
If your users consume applications via an internal or corporate network and you’re looking to protect against the full site failover scenario, integrations can be straightforward:
Full Site Failover
If the corporate network spans multiple locations, connected via a WAN, such as an MPLS, E-LINE, VPLS, VPN mesh, SD-WAN, or point-to-point links, and you’re running a dynamic routing protocol such as RIP, BGP, OSPF, EIGRP, and so on, simply advertise the production subnets from the DR site during failover events.
For static configurations, such as site-to-site VPNs, there are a few options:
Configure the DR network with the same internal IPs as the production site, connect to DR site via VPN on-demand for actual failover or testing (site-to-site and site-to-client)
Leave the VPN connected at all times and statically route the production networks over the VPN during failover events
Partial Site Failover
For internal WAN network, partial site failover scenarios, we can employ the same strategies as above if our applications each live within their own specific subnet.
Dynamic routing: we advertise a smaller aggregate from the DR site related to the application’s specific subnet, knowing that the more specific route wins
Static routing: we statically route the more specific subnet to the DR site, this can be accomplished in multiple ways depending upon your exact configuration. However, if your network contains multiple layers of layer3 devices, specific endpoints will need to know how to reach the DR site (and back) - Meaning you may require multiple static routes in different places.
Partial Site & Single Server Failover
In the scenario where an application does not sit within its own distinct subnet, or if you want to failover a smaller sub-set of servers, you need to look at alternative configurations. This brings us to the dreaded layer 2 stretch. A layer 2 stretch means extending a local LAN from 1 physical site to another. This extended layer 2 network allows devices to utilize the same local IP addresses on both sides. While this is a straightforward configuration, it brings on a tremendous amount of pitfalls and challenges related to the proper management of traditional spanning-tree based layer 2 networks.
As such, utilizing layer3 site-to-site and multi-site connectivity, as mentioned in the above Full and Partial site failover scenarios, is always preferred. However, when those are simply not available, we can look at the following options.
Pure Layer 2 Stretch
The first option is an actual layer 2 stretch. This works with a 2-site configuration, production and disaster recovery, or a multi-point extended LAN network, sometimes called an E-LINE or Ethernet Virtual Private Line (EVPL).
One may ask, “what good is it to stretch the network between production and DR in a 2-site configuration? If the production site goes down, there’s no way to consume the applications at DR.” While this is true, many organizations consider this configuration to protect against less dramatic events such as ransomware, single server failures, application problems, and so on. Their users may be physically at the production site or connecting to production via VPN, and want to protect against ALL other failures besides the production site & networking going down. They may then also have backup/direct VPN connectivity available at the DR site (see figure).
Again, as long as the actual network is up, the organization can recover from all sorts of disasters, except the full site failover. And for some, changing how users consume applications is not an easy thing, so this does provide many levels of assurance and makes sense for a simple phase 1 deployment.
However, simply “stretching” your LAN is a horrible idea. In fact, doing so without a ton of very specific security controls might actually expose your production environment to more of the downtime you were hoping to avoid with such an exercise.
There are better ways to “stretch” a LAN these days, including VXLAN, VMware NSX Edge, Layer 2 VPN, Double-NAT, and SD-WAN.
VXLAN
VXLAN, or Virtual Extensible LAN, is a tunneling protocol that will allow us to bridge layer 2 LANs (overlay) by sitting them on top of a layer 3 network (underlay). This configuration eliminates the risks associated with L2 domains spanning multiple logical switches and spanning-tree problems.
With VXLANs, we can interconnect production and DR sites, and intermingle the same IPs at both ends without having to worry about the associated risks. VXLANs will be the best way to support the partial site and single server failover scenarios. However, it requires both the production and disaster recovery sites to be running network devices that support this technology.
Alternatively, the VXLAN tunneling can take place at a higher level, specifically in software. For example, VMware’s NSX supports VXLAN and will allow you to bridge virtual networks between 2 vCenter clusters in different locations. In a scenario where you’re looking to failover a single VM or multiple VMs from production to DR, and assuming your production network, compute, and storage, are all available, VXLAN tunneling via VMware NSX is a great option. You can configure it completely in the virtual layer and can completely avoid dealing with your physical network gear.
VMware NSX Edge
If you happen to be running VMware and NSX is not running at production, there is still hope. VMware’s NSX Edge appliance will allow you to establish a layer 2 bridge over an IPSEC tunnel. As long as the DR site or your service provider is running NSX on their side, it will work and is included with existing VMware licenses. This is an extremely powerful and safe way to provide a partial failover capability for applications, multi-server, and single server failover/back without having to touch your physical network layer at all. You can read more about that here.
Layer 2 VPN
The NSX Edge appliance is not the only software routing device that supports L2 over IPSEC. In fact, this configuration is possible with other network appliance and firewall vendors. Specifically, RFC 1701 (GRE Tunnels) has been a thing for a very long time, though it comes with no native encryption and is often relegated to a few very specific use-cases.
However, if your network vendor supports GRE on top of tunneling technologies that support encryption (i.e. L2TP/IPSEC, PPTP), you’re in luck. One such application is MikroTik’s Ethernet-over-IP (EoIP) which utilizes the GRE standard and allows you to bridge LANs on top of encrypted tunnels.
Double-NAT
The last option to bridge LANs on both ends is to utilize a Double NAT configuration, where a firewall device sits in both LANs (production and DR). Assuming a single server failover event, the firewall at the production site will mimic the failed server’s IP, and translate it to a NAT on another network that is connected to the firewall at the DR site. The firewall on the DR site will do the inverse, translating it back to its production IP at the DR site. This configuration comes with the burden of being very hard to maintain long term. The amount of configuration changes required on both the production and DR firewalls for every single IP is onerous. Additionally, troubleshooting issues would also be complicated and time-consuming.
Veeam NEA
As discussed, configuring any sort of DR network connectivity can be time consuming and challenging, even for the simplest deployments. Fortunately, some replication software tools have taken it upon themselves to build basic networking capabilities into their products. One example is Veeam’s NEA (Network Extension Application). The NEA is a small, linux based VM which is deployed automatically by Veeam on the production and DR infrastructure, and which is kept shut down during normal operations. When a failover occurs, Veeam will turn on the appliances as part of its failover orchestration. The NEA will then automate the cumbersome DoubleNAT configurations to extend layer2 between production and DR networks in a fully automated fashion. In addition to l2 stretch for partial failover events, It also support full site failovers, mimicking the default gateway at the DR site to provide basic north/south network connectivity. For organizations who use Veeam as their data-mover, and who have simple DR networking requirements, this may be a very simple and fast option to implement. Unfortunately, the NEA does not provide more advanced network integration capabilities, so it may not be a good fit for more advanced use-cases.
Read more about the Veeam NEA here: https://helpcenter.veeam.com/docs/backup/cloud/cloud_network_extension_appliance.html?ver=100
SD-WAN
Similar to the above references to VPN configurations, SD-WAN deployments such as Cisco Meraki MX, VeloCloud, Viptela, Silver Peak, and others, provide capabilities to route corporate WAN traffic between physical locations in a straightforward manner. This is typically accomplished via initiative user interfaces that handle the complex routing behind the scenes. The advantage of these systems is that they can provide a method to build policies or rules related to full site failovers, application-specific failovers, and even per server. Administrators can pre-define and pre-build these rules, knowing which they would enable during a failover event or for testing, without the need to copy/paste a long list of rules into terminal windows.
In conclusion, most organizations will deploy a mix of full site & partial site failover configurations so that they can protect against multiple types of failure. This adds a new dimension to the overall strategy, ensuring that your full site and partial site configurations don’t conflict or make each other overly complicated. This doesn’t mean you must map out your entire architecture before attempting to achieve a phase 1 deployment. However, understanding which strategies may or may not work long term will be helpful as you start implementing your multiple phases.
Public Internet Strategies
In addition to routing traffic across corporate WANs, many administrators must consider how their public internet-facing applications will function when failed over to a DR site. Below are strategies to consider to reduce the RTO related to these applications.
DNS
DNS is the easiest option to consider when re-routing traffic from a production site to a DR site. Some application protocols are designed to allow multiple DNS entries to be pre-defined for production/backup systems and will automatically route users to the DR site if the primary is unavailable. This is the case for SMTP (e-mail services) via the use of the MX record’s preference number, where you can define which server to use first, second, and so on. Likewise, SRV records, used for Voice-over-IP (VoIP) applications, support a similar configuration. In both cases, traffic will route to the DR site easily without any changes, and route back to production once it’s online.
Unfortunately, for almost all other types of DNS records, such as those used for website addresses (A and AAAA records), this type of primary/backup DNS configuration is not supported. However, one common strategy is to lower the DNS cache time (TTL) for a DNS record in advance from the default of around 12 hours to a few minutes. During a DR event, the DNS record can be changed to the DR site, and the new address will be utilized once caches expire, typically after a few minutes. Many enterprise DNS hosting services (such as NS1, DnsMadeEasy, and many others) build this capability into their service offerings and can automatically change the DNS response to the failover IP based upon a multitude of metrics that the site administrator can define.
Proxy Services
Another strategy that can be utilized is to route all production traffic through a 3rd party proxy service such as Cloudflare, Fastly, and Akamai. These platforms are typically used to provide websites and applications protection from security vulnerabilities, DDoS attacks, and other malicious activity (WAF capabilities). They also improve application performance via the use of static and dynamic content caching (CDN), and edge compute delivery. Similar to the enterprise DNS providers above, proxy services allow users to define primary and backup sites, automatically rerouting users to the DR site and back based on granular metrics. Keep in mind, you must utilize these services at all times, making them an “in-line” solution. There are many pros and cons to such proxy services, however, an in-depth deep dive is out of scope for this document.
GSLB
Another strategy is to utilize load balancer or gateway appliances (virtual or physical) that support a global load balancing configuration, such as F5 or Citrix Netscaler. These devices will be placed at both the production and DR sites, and can automatically reroute users to the DR site if production is down. These appliances internally make use of some of the same DNS techniques mentioned above, but in an automated fashion that is transparent to the administrator. If you have a need for enterprise-level load balancing requirements in front of your applications anyways, this may be a viable option.
Public IP Announcements
Some organizations own their own public internet address space, assigned via a Regional Internet Registry (RIR), such as ARIN. These organizations typically advertise their IP space via BGP to their upstream providers. In the case of a DR event, the same IPs may be advertised from the DR site easily, and without the need for an IP re-number, DNS changes, or anything else. While this is a great option for the full site failover event, it provides no options for application-specific or partial failovers. This is because you can only failover at least 1 /24 network (a C class with 255 IPs) at a time. Typically, organizations will have many applications contained within such a network, so this type of capability is generally not as useful as people think and only provides protection for worst-case scenarios. It is generally advisable to pre-configure and pre-plan for this type of configuration if, in fact, the organization does own their own space. However, these organizations should still plan to use other methods for more granular failover control.
3rd Party Connectivity Strategies
One aspect of DR networking not covered thus far is what to do with private connectivity to 3rd party networks. Many enterprise organizations maintain private connectivity between their production sites to multiple 3rd parties. For example, healthcare providers typically connect privately to their EMR vendors, banks to their payment processors, and financial institutions to market data feeds. Additionally, private connectivity to public clouds is also becoming more common. In fact, today, private connectivity is beginning to trump public internet for critical, secure, and low-latency enterprise interconnection.
The typical strategy for dealing with these private connections in the context of DR is simple: duplicate all private connectivity required at the DR site. This is a straightforward strategy, and up until now, there really hasn’t been any good alternatives. Additionally, you’re paying for double the circuits, you have double the number of routers & links to maintain, and these backup circuits are never used until they need to be, so unless you’re testing often, you may not realize if they no longer function until it's too late.
Typical 3rd party connectivity strategy:
Many administrators think this configuration is simply a necessary evil of DR, however, there is hope. We previously learned in this document how SD-WAN can help simplify internal WAN failover traffic. Software-defined networking can also provide a better capability for 3rd party networking as well. Specifically, we can utilize Network-as-a-Service (NaaS) platforms, such as Equinix ECX, PacketFabric, Megaport, or Console to land 3rd party connections on. Once these circuits are terminated onto the NaaS platforms, they can be treated as virtual, or software-defined objects, which can be easily adjusted or modified in real-time. With the use of NaaS for DR, the diagram above can look like this instead:
In the example above, instead of terminating the 3rd party connections at the production site, they’ll connect to the NaaS platform. The production and DR sites will both connect to the NaaS platform as well, which are typically located in popular Meet-Me-Rooms, or available via last-mile connectivity. During a DR event, the single/active connections to a 3rd party can be re-routed from the production site to the DR site via the use of a simple GUI or API call. In some instances, you may prefer a separate physical link directly to production & DR, however, for most others, this method reduces complexity, cost, and reliability.
Replication Traffic Network
Another area where NaaS platforms can be helpful is related to the network connection used to replicate traffic between the production and DR sites. This is because we can easily dial up and dial down the size of the virtual private connections which flow over a NaaS platform. For example, you can physically connect production and DR at 10Gbps and ensure your initial data copy is fast. After that, you can dial it down to 1Gbps to handle the change data only. Suffer a failure where DR gets used and then you need to replicate back to production fast again? Dial it back up to 10Gbps for the week! This provides amazing flexibility and substantial cost savings.
This may not seem terribly important, however, most administrators only think about these replication links in the context of how much capacity they utilize during “normal” times when they’re only being used to replicate change data. After a DR event occurs and a great deal of data must be synchronized back from the DR site to production, the link may be maxed for days. This will prevent you from switching back to production even once it’s actually ready, which could have a massive impact on the business. The false sense of security provided by the low utilization of the replication network is dangerous, but using a NaaS platform for the replication traffic can provide long term assurances in this context.
DIY or Managed?
As you can see, DR networking can be straightforward in some cases and very complex in others. Unfortunately, we don’t see it simplifying any time soon. As the definition of an organization’s “production” environment shifts to a true Hybrid IT model, with infrastructure spread across physical data centers, hosted private clouds, public clouds, and SaaS platforms, organizations’ networks and security requirements will only increase in complexity. Because of this, many organizations are shifting from self-managed DR infrastructure to consuming their Disaster Recovery infrastructure “as-a-Service”.
In the case of fully managed DRaaS, customers are not purchasing infrastructure (compute, storage, network) as much as they are contracting for an SLA of the RTO and RPO of their critical applications. Not only is all of the DR infrastructure included, but often overlooked is the inclusion of the DR networking strategy planning, design, implementation, and management. This is inclusive of all of the various strategies set forth in this blog, as well as others that are more application and use-case specific. Many of our customers work with us for this expertise and because they simply don’t have the appetite or experience to design, build, or will to be accountable for it long term. Having worked with so many customers across multiple industries with varying applications and use-case has also provided us with a wide context that we draw against in future implementations.