Planning a Successful Disaster Recovery Strategy
Here at Rubrik, we strive to provide a strong suite of tools that make backup and disaster recovery (DR) as easy as possible. We’ve developed technologies like Live Mount, CloudOut, and CloudOn to help businesses recover quickly from a disaster and minimize downtime. In my experience, technology—no matter how advanced—is only a piece of the DR equation. A fully developed DR plan requires people, process, and technology in order to truly be successful. Here’s a look at some of the top considerations when developing your DR strategy:
First and foremost, you must have the right stakeholders and executive sponsorship sign off on your strategy so that they can go to bat for you when allocating people and resources. Successful DR planning requires time and input from many people across your organization—the larger the company, the more people involved. Your team should be representative of all key areas of your company, with smaller groups concentrated on logistics and operations where necessary. Here are some examples of how to effectively make up your team:
- Business Unit Stakeholders: Identify critical applications, success criteria, and potential roadblocks. Provide input to other teams as needed.
- Application Owners/Analysts: Own dependency mapping and success criteria for application tests.
- IT Infrastructure Engineers (Compute/Networking/Storage/Voice/Databases): Perform the bulk of the work during tests and events. Specify DR site resource sizing, connectivity, and automation tooling.
Everyone must be clear about their roles and responsibilities, and you may need to account for those that are “single points of failure” in the team. I’ve seen more than one effort fall apart because a critical team member did not complete an assigned task.
Start by completing a responsibility assignment matrix, or RACI (Responsible, Accountable, Consulted and Informed), and project plan so everyone understands what they are expected to do, and when they are expected to do it. A RACI matrix is used to determine the tasks each individual or group owns. Typically, every task in the matrix will have a responsible and accountable party assigned, and the other values are assigned as needed. Having effective project management makes this process much easier to execute.
With your team in place and basic responsibilities assigned, it’s time to dive into the details. First, determine which systems are in scope, along with your target Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
- An RPO is a decision about how much data you are willing to lose in the case of a disaster, and it is typically associated with the frequency at which backups are run.
- An RTO is the target timeframe for returning to normal operations after a disaster, which is roughly the amount of time it would take to execute a DR plan.
Almost all other decisions are tied back to your RPO and RTO. Developing a plan is a difficult process, taking months, or even years to complete. Think about each distinct failure condition, not just the worst case scenario. A critical application failing can have the same impact as your data center being hit by a meteor.
Regular testing is the most crucial, and most time consuming, piece of your DR strategy. Success criteria must be well-defined and measurable, but it is important to remember that a failed test doesn’t mean your team has failed. You’ve actually uncovered a flaw in your plan that can be solved before a real disaster happens. Everyone strives for a successful test on the first try, but this rarely occurs. Make sure the team has realistic expectations, especially for the first few tests.
Don’t forget your fail back plan. You may test your DR process many times, but often the DR environment is torn down once the test is complete. In the event of a true disaster, you need to be able to migrate your workloads back to your data center. Pay special attention to this if you are using a cloud provider as a DR site. There is a reason why people refer to the cloud as “Hotel California”—often it is much easier to get workloads into the cloud than it is to get them out.
Your existing technology stack has major implications on how difficult DR will be. Are you 100% virtualized? Congratulations, you will almost certainly have less headaches than those with lots of bare metal workloads! IT shops still running mainframes or non-x86 workloads are in a particularly tough spot since they are limited to DR/cloud providers that support these systems. Couple this with the fact that mainframes are typically older systems, and it becomes clear that enterprises in this situation must have a well-thought-out DR strategy. While it’s not usually a part of the DR discussion, modernizing your technology stack is an important consideration.
One common area of concern, and a frequent stumbling block, across every business, is networking. Make sure you understand the implications of moving around public IP addresses if your business is serving resources on the internet. Often, this requires submitting a Letter of Agency to the ISP at your DR site, authorizing them to advertise your IP address range. In some cases, public IPs cannot be moved and require businesses to orchestrate a DNS change or implement some flavor of Global Server Load Balancing (GSLB). Depending on the complexity of your network, consider how to keep the network at your DR site up to date as you add VLANs or subnets in production. Having fast, redundant connectivity to your DR site is crucial if you rely on storage replication for some part of your DR. Storage arrays may not gracefully handle a degradation or disruption to the connectivity between sites.
Properly sizing your DR site can be more of an art than a science. Many businesses assume that compute, networking, and storage resources must be sized exactly the same as production. This can lead to sticker shock, and in some cases, a warm DR site full of equipment that is only used a fraction of the time. Try to strike a balance by remembering that in a true disaster, availability is usually more important than a performance hit.
If you’re starting to have an anxiety attack from reading this, I have good news! Rubrik Cloud Data Management makes the backup, replication, archiving and recovery of your infrastructure dead simple. Disaster Recovery is a daunting task, but with the right team, procedures and technology, it is possible.
Want another in-depth look at how you can up-level your DR strategy? Check out this blog post on evaluating options for DR in the cloud.