Ensuring business continuity: Disaster recovery plans with Azure

Trying to come up with a DR plan for your organization? Here's how you can use Azure to ensure the continuation of your business if disaster strikes.

By Neil Hitchins

Jun 20, 2024 • 8 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

Having a well-planned and executed disaster recovery plan is critical to your business. But like any solid plan, there’s more to creating it than filling out an online template.

In this article, we’ll dive into disaster recovery strategies that use Microsoft Azure. Keep in mind, this isn’t a definitive guide to disaster recovery. There are multiple ways to ensure business continuity, but using Azure is a good place to start.

Setting the stage: A disaster recovery scenario

You’re sitting at your desk working away when all of a sudden, voices get louder and more frantic. Phones start ringing and people begin to gather. The IT manager walks in looking concerned and pulls a few of you together to announce that you’ve just lost your entire production datacenter.

An entire datacenter sounds a little extreme, but it does happen.

Having a well-planned and executed disaster recovery plan is critical to the continuation—and success—of your business.

So where do you start?

How to create a disaster recovery plan

Questions to ask before you create your disaster recovery plan

Before you fill out one of the many disaster recovery plan templates online, take a step back and ask yourself two key questions:

What do you need to protect?
How critical is it to business operations?

Once you’ve answered these questions, you can dig a little deeper.

Disaster recovery plan example

Now we’ll walk through an example disaster recovery plan using Azure.

Let's say you identify a key application the business needs to succeed. The application is built around a multi-tier approach. If we focus on virtual machines (VMs), this could look similar to the diagram below.

This example application runs on Microsoft Windows and has various components we need to protect—and yes, you can spot the single point of failure (SPOF) in this design.

The web tier is currently made up of three virtual machines with Internet Information Services (IIS) installed.

The application tier is running a customized piece of code that runs queries against the Microsoft SQL Server database and formats the results to present back to the web tier.

The database tier is running on a single Microsoft SQL Server database.

The identity service is running on two virtual machines running Active Directory Domain Services (AD DS).

The diagram does not show anything below the virtual machine layer and load balancers, but the virtual machines run on several bare metal servers running VMware hypervisors.

To simplify things for this example, this application communicates only within itself and to the identity service to serve customers on the internet. And apart from the database, which is a single entity, everything else has been designed with high availability in mind.

Information to include with your disaster recovery plan

Now that you’ve identified a key application and how it works, you should document this information along with the answers to these questions:

Which teams use the application and from where, and what are the inbound/outbound data flows?
Who is the business owner?
How long can the business tolerate application downtime?
How much data can the business realistically lose?
What risks could cause an outage?
What’s the reputational impact if the application is offline?

Which teams use the application and from where, and what are the inbound/outbound data flows?

This can be a mixture of both internal and external users and affects things such as DNS configuration, identity, and networking configuration. Make sure you understand the mix so you can also understand and document the path of communication.

Who is the business owner?

Again, this is really important. You’ll need to understand who to contact if there’s a problem with the application and for authorization/notification to invoke the disaster.

How long can the business tolerate application downtime?

This answer will drive the type of disaster recovery strategy you need in place for the workload and drive the recovery time objective (RTO).

How much data can the business realistically lose?

The answer to this question also performs a key step in planning which disaster recovery strategy you should put in place for the workload. Data backup and consistency are key and help form your recovery point objective (RPO).

Why do RTO and RPO matter? Without them, it’s hard to understand how much value the business truly puts on the application—and how much money they’re willing to spend to ensure recoverability. Not every approach will provide the same ability for RTO and RPO, so this will rule out certain disaster recovery strategies.

What risks could cause an outage?

This can be as simple as a rogue database query that prevents the application layer from running its own query against the database to send to the web tier. Or it can be as complex and crippling as your datacenter engulfed in flames. There are varying methods to understand application risks as part of your DR planning, but the two main approaches are quantitative and qualitative risk analysis.

What’s the reputational impact if the application is offline?

Depending on the application and its usage pattern, the impact could be anywhere from mild to severe. For example, an application may be critical to the business only one week every quarter, or it could be critical to day-to-day operations.

It’s important to understand the triggers for invoking a disaster and where the tradeoffs are between disaster invocation with eventual failback and spending extra time troubleshooting an outage. This information will help decision-makers choose the best option.

Identifying the application recovery location

The next step in creating your disaster recovery strategy is to identify the application recovery location. For this application, the business has the option to recover it to another datacenter they rent or own or to an office computer room. This could be quite costly if space isn’t already available.

Another option is to use the cloud as the recovery location. I’m not going to say this is the best approach since it’s a decision driven by other factors (e.g., the company’s business technology strategy). But it is one which can work and opens up opportunities for application workload hosting going forward and building new applications that can use managed cloud services, such as platform as a service (PaaS) or software as a service (SaaS).

How could we do this for our application? Let’s assume the team looking to put the recovery plan in place has decided not to use a company-owned or leased datacenter or computer room and instead will use Microsoft Azure because it’s part of their wider business technology strategy.

Disaster recovery in Microsoft Azure

There’s a fairly comprehensive on-ramp to using cloud services, which each cloud provider covers in their cloud adoption frameworks, so I won’t cover this here. Instead, let's look at the options for recovery in Microsoft Azure.

Microsoft Azure disaster recovery options

The business can choose from three main options if the application has failed in their production environment:

Ensure data is backed up and stored in Azure, or replicated to Azure, ready for recovery, build virtual machines, and restore data.
Replicate all data, including virtual machine disks. Invocation automatically creates virtual machines and completes automatic failover configuration.
Create a complete replica in Azure and replicate all data from production to recovery. Use DNS to reroute traffic for recovery.

As you’d expect, the quicker the recovery time, the more expensive the solution when it’s inactive.

In this example, the application is key to the business, but they don’t want to effectively double the cost for running the solution, so they’ve opted for the second approach.

Azure Site Recovery

So how can we achieve this? Azure provides the bulk of this capability in the form of Azure Site Recovery. Azure Site Recovery is a service specifically designed to help keep applications and workloads running in the event of a disaster. It can replicate physical and virtual machines from a source to destination. In particular, it can replicate:

Azure VMs between regions
On-premises virtual machines
Azure Stack virtual machines
Physical servers

For our example, we’ll look at the option to replicate VMware virtual machines to Azure.

How to configure an application for disaster recovery

Looking at our diagram, we need to complete the following to configure this application for disaster recovery:

Complete the necessary steps from the Microsoft Cloud Adoption Framework for Azure
Set up configuration server in the VMware environment
Create recovery services vault in Azure
Configure virtual machines for replication

Azure Site Recovery plans

Plans are a great feature of Azure Site Recovery. They let you configure the recovery order of servers, allowing you to run pre and post actions against them as they’re being recovered. These can be automatic through Azure automation or manual, where someone needs to manually follow a series of steps.

Once this is complete, our picture should look something like this:

All that remains is the addition of load balancers and a method to reroute traffic to your recovery environment. You can use an Azure Resource Manager (ARM) template to create and configure the necessary load balancers. And then you can configure the ARM to the virtual machines through an automated step in the Azure Site Recovery plan.

The last step is to reroute traffic between the two sites using Azure Traffic Manager, and then, voila! You’re done.

Futher learning

Want to learn more about monitoring, backup, and recovery in Azure? Check out my latest course as part of the AZ-104 certification path, Microsoft Certified Azure Administrator Associate: Monitor and Maintain Azure Resources.

Neil H.

Neil is an IT professional working as a Cloud Author at Pluralsight. He’s been in IT for over 20 years, starting off as a Windows engineer and moving into virtualization and storage before going into architecture as an infrastructure architect. He finally moved into cloud architecture, helping companies adopt the cloud. Neil has a keen interest in cloud infrastructure and security and covers AWS and Azure cloud environments.

More about this author