Small-to-medium size enterprises that leverage the cloud often overlook the critical requirement of having a disaster recovery plan (DRP) in place for their data and production environments. A DRP is a collection of processes that quickly migrate production application traffic from one cloud region to another in the case of a major catastrophe within the primary region. DRPs are vital for business continuity because when enterprise resources go offline, revenue can be impacted and business reputation can suffer immensely.
Many enterprises mistakenly believe that because their resources are in the cloud, they will always be available. Not so. The cloud does not have inherited disaster recovery (DR) in place because it’s always possible for an entire region’s datacenters to go offline simultaneously.
Furthermore, while cloud providers shoulder the responsibility for storing and protecting clients’ data and running their mission-critical applications, it’s up to the enterprise to minimize downtime should any datacenter or even a subset of cloud services go offline. Though rare, this does occur, as shown in reports by Amazon, Microsoft and Google.
Even for enterprises with a significant on-prem presence, the cloud can serve as a DR strategy where a domain name service (DNS) points production traffic to a cloud-based DR site. In these cases, the cloud significantly lowers costs from not having to own and maintain two on-prem datacenters in different regions.
Designing and maintaining an enterprise DRP can be time-consuming and cumbersome, but its value to your business will be immeasurable if it prevents true disaster from happening. Here are six steps to consider to help get you started:
1. Create a Contingency Statement
Begin by formalizing a set of rules or guidelines that authorizes a DRP to be developed and implemented in your enterprise. This is the mission statement that defines the boundaries and requirements of the DRP. It can be a reflection of your enterprise service-level agreement (SLA) that states that within a certain amount of time, mission-critical components will be redundant to a certain level.
2. Conduct a Detailed Business Impact Analysis (BIA)
The BIA identifies and prioritizes your mission-critical IT applications and components. It should be a collaborative effort between the infrastructure, web, and product management teams to document these components in a tiered manner.
Determine the order of application and data store importance with the following classifications:
- Absolutely mission-critical: The major revenue generators with as minimal downtime as possible, measured in minutes or hours.
- Semi-important applications or components: Minor revenue generators with larger acceptable downtimes.
- Low-tier applications or components: Little to no revenue-generating impact. These might have a downtime of several hours to days with little or no impact on the mission-critical applications.
Each tier should have its own SLA and detail on potential downtime losses and how the risks will affect business operations and growth. Emphasis should be placed on two key elements:
- Recovery Time Objective (RTO): The maximum acceptable time that your application can be offline.
- Recovery Point Objective (RPO): The maximum targeted period in which data might be lost from an IT service due to a major incident; i.e., the amount of time that an application or data store can tolerate data loss.
Personal data needs to be included in this discussion as well, in terms of the value your enterprise places on maintaining and protecting sensitive customer data.
The collaborative effort on the BIA comes from business owners articulating the biggest revenue losses, application owners illustrating how applications would behave during a shutdown, and operations and infrastructure team members who would be responsible for enacting the DRP. In the end, you should have a solid outline of how to implement the DR strategy for each tier, and when a region is lost, what are the steps for a site to come back online with as minimal business loss as possible. All of these conclusions will be part of the DRP documentation.
3. Draft the Contingency Plan
The contingency plan identifies “who does what,” distinctly naming those responsible for enacting the various DR procedures.
4. Outline the Control Measures
This is the step-by-step process of the DR procedures, consisting of three types of control measures:
- Preventive measures: To identify and reduce risk should a disaster occur (having a current backup and restore model, for example).
- Detective measures: To uncover unwanted events within the IT infrastructure (via antivirus and networking monitoring software, for example) that could stand in the way of corrective measures.
- Corrective measures: To restore the system in a secondary environment following a disaster event (a precise series of steps ensuring systems are up and running within the RTO and RPO constraints). The secondary production environment can be a reduced or exact replica of the primary production environment.
To have an effective and attainable DRP, Infrastructure as Code (IaC) practices for network infrastructure and application and data tiers must be implemented for leveraging the secondary production environment. Amazon AWS uses Cloudformation templates and Microsoft Azure uses ARM (Azure Resource Management) templates, both of which turn infrastructure into software that can be version controlled and backed up. Google Cloud is still vetting its IaC solution.
The extent of DRP appropriate for your enterprise will depend on your BIA. It might be one the following:
- Pilot light: A small implementation in another region that can be easily spun up to take full production traffic.
- Warm site: Frequent replications going back and forth in a larger implementation.
- Multi-site implementation: Both regions serve the same amount of traffic, but each region has sufficient resources to where if one region goes offline, the other region can take all the traffic seamlessly.
As organizations review the ‘options’ for a given application as well as the cost/budget impacts it is common for the DRP to be updated and changed over time.
5. Implement Testing and Training
A DRP is a wasted effort if it’s not tried and true. Review — and whenever possible, test — all the steps in the DRP quarterly or biannually to ensure the failover process is fail-safe. Senior management and every employee must be trained in their part of the DR procedures to ensure they explicitly understand how to execute the steps.
If your development environment is close in scope to your production environment, you can run the tests there. Keep in mind that the needs of a particular department in your organization may change over time, and regular testing can help to identify those changing needs. These changes should be taken into consideration after each testing process.
6. Plan for Maintenance
The maintenance plan should be a live document updated on a regular basis to remain tuned with system enhancements. This document should be updated any time regular testing is performed. Also keep in mind that AWS and Azure consistently release new features that might impact your DRP and that also might help to automate some features that currently require hands-on attention.
In addition to the DRP, enterprises need to have a formal backup and restore model as part of their DR strategy. It’s surprising how many firms actually do not have a reliable model in place. The model should be documented with detail regarding the data that is backed up, the process to restore it when needed, and how often the process is tested.
A smart enterprise backs up its data at least every month. It’s wise to have regular backup and restore exercises because if you have a problem, it’s best to encounter it in one of these regular exercises rather than during a moment of disaster.
An Essential Insurance Policy
Think of a DRP as a critical risk-mitigation insurance policy for your business, and as an essential part of your business continuity planning. If you don’t have a DRP in place, start by having conversations with stakeholders around the BIA and control measures. These conversations will start to build a roadmap of the requirements and assumptions for an attainable, sustainable and cost-effective DR strategy.
If you need DR planning assistance, AIM can help by performing a comprehensive DR assessment and recommend the best strategy for your organization.