In the summer of 2021, Rubrik officially released its first SaaS-based automated Disaster Recovery (DR) solution, Orchestrated Application Recovery. Orchestrated Application Recovery is incredibly easy to use: no need to install new binaries, no need to integrate between different vendor’s products. Simply subscribe to this service, define your blueprints, and everything will be ready under your finger: just two clicks to trigger a local (in-place) recovery, test failover, or production failover.
In this article, we are taking a deep dive into some demanding DR features: how does Orchestrated Application Recovery achieve near-zero RPO & RTO at the same time?
Understanding RPO and RTO
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are two of the most critical metrics of a data protection plan and disaster recovery strategy. These measurements are critical for businesses to achieve their service level agreement (SLA) of application and data availability. Despite their naming similarities, RPO and RTO serve completely different purposes.
Recovery Point Objective = Data loss.
RPO refers to the maximum acceptable amount of data loss an application can undergo before causing measurable harm to the business. In other words, it is the point in time you can recover an application in the event of a disaster.
Recovery Time Objective = Downtime.
RTO states how much downtime an application experiences before there is a measurable business loss. In other words, it is the time that it takes to recover data and applications in the event of a disaster.
Both RPO and RTO have a direct impact on the business continuity of an enterprise or organization.
Continuous Data Protection
Now we have a good understanding of what RPO and RTO mean for data protection and disaster recovery. The primary goal of a DR administrator is to minimize data loss and downtime as much as possible. While snapshot technology is mature and effective to protect data at discrete points in time (typically every few hours), for mission-critical applications, even the slightest amount of data loss can cause a significant impact to the business. To ensure business continuity, these applications have a much more aggressive RPO and need to have data backed up and protected every time the data is changed and restorable to the last point of modification.
Rubrik CDM 5.1 release delivered VMware-certified and natively-integrated continuous data protection (CDP), providing near-zero RPOs for VMware environments as a seamless option in the SLA Domains by which customers define their data protection policies. By leveraging a journal-based approach, Rubrik offers a continuous stream of recovery points for enterprises to minimize data loss in a failure or ransomware attack. One click within the SLA policy engine enables continuous data protection for the most critical VMs, eliminating the complexity of installing and managing yet another point solution.
After creating an initial full backup of your data, CDP operates in the background, making note of every subsequent disk change within a specified time frame and storing it in a journal file. By recording all the changes up to a failure, you’ll be able to review the log and easily roll your system back to the point you desire. The automatic, continual recording of changes gives you the flexibility of recovering data to a much more granular degree than other backup methods that restore to a previous point in time.
Rubrik CDM 5.3 and the Summer 2021 release further enhanced CDP service to support:
1. Automatic recovery (a.k.a. “CDP Resync”) after interruption without requiring a snapshot. With this groundbreaking technology, CDP is much more resilient in recovering a VM from an out-of-sync state with as little downtime as possible.
2. The retention window extended from 4 hours to 24 hours. A longer retention window not only provides a long history of continuous recovery points but also allows less frequent snapshots to reduce backup workload and VM stun time.
From day one, Orchestrated Application Recovery has integrated with CDP seamlessly. When creating an Orchestrated Application Recovery Blueprint, we can add VMs protected by different SLA Domains to the same Blueprint:
1. For regular snapshot-based SLA, the available recovery points will be discrete points in time of the snapshots.
2. For CDP-enabled SLA, the available recovery points will be a continuous recovery window in the past 24 hours.
By default, we can choose the latest recovery point at target if we want to recover a Blueprint to the latest available point in time (to minimize the data loss), but Orchestrated Application Recovery also allows you to choose older recovery points for each VM if you wish to do so. Without CDP, we can recover to the latest available snapshot, which is typically captured a few hours ago. With CDP, we can recover to the latest CDP recovery point, which is typically a few seconds ago.
In the DR context, we are familiar with terminologies such as backup and recovery, replication and export, etc., but what is hydration? The term “hydration” has different meanings in different contexts. Generally speaking, hydration refers to the process of filling an object with data. The object could be an XML file, a DB instance, or a DR-protected entity such as a virtual machine.An object that is not hydrated would incur a complete population of data when needed and might incur a huge delay. Hydration is usually done for performance reasons
For Rubrik Orchestrated Application Recovery, hydration can be looked at as a reverse process of backup:
1. Backup is the process of ingesting data from the source object which is being protected.
2. Hydration is the process of exporting data to the target object which is being recovered.
The key difference between Orchestrated Application Recovery hydration and traditional export is that the former supports incremental data transfer. In other words, if hydration is enabled for a Blueprint, only the first time a full copy will be transferred to the target site. All subsequent data transfers will be incremental. Only the data that has changed since the last snapshot (or CDP recovery point) will be transferred. Incremental hydration significantly reduces the recovery time of large Blueprints with terabytes of data.
If we use Big-O notation to compare Orchestrated Application Recovery hydration with traditional export, the “time complexity” is like this:
Orchestrated Application Recovery Hydration: O(C), almost constant-time, where C is the data churn (i.e. changes concerning prior snapshot), regardless of the size of the objects to be recovered.
Traditional Export: O(N), linear-time, where N is the size of the objects to be recovered.
Orchestrated Application Recovery hydration happens automatically without user intervention. All the protected objects will be incrementally transferred to the target site, and ready for recovery in case any unexpected disaster happens. Below is a high-level architecture that shows the data flow from the source production site, all the way to the target recovery site.
Smart businesses know the value of protecting their data, but they have a lot of options when it comes to data protection methods and frequency. The bottom-line goal with any data protection solution is to ensure that you can restore data and get operations back up and running as quickly as possible after a disruption, such as software failure or data corruption. Two questions you might ask when deploying your DR solution:
Question #1: Is it technically possible to get your RPO and RTO to near-zero? With Rubrik Orchestrated Application Recovery, the answer is YES!
Question #2: Do you want to? It depends.
CDP enables an added layer of protection for mission-critical VMware workloads. The primary benefit of CDP is a near-zero recovery point objective (RPO)–you suffer little to no data loss if you need to restore. The always-on nature of continuous data protection means your backup copy is continually up-to-date, so if you experience data loss, CDP can recover that data virtually in real-time.
CDP isn’t the end-all, be-all of backup solutions, however. It does bring with it several challenges to consider. CDP requires physical disk storage that offers fast performance and therefore can raise costs. CDP also puts increased pressure on your memory and network resources. Because every change or a bit of new data is saved to backup in real-time, your data throughput is essentially doubled, and an extra memory buffer is needed to handle sudden bursts.
Similarly, Orchestrated Application Recovery hydration also brings challenges. The main advantage of hydration is that it achieves an almost constant recovery time objective (RTO). Also, it can avoid sudden network congestion during failover because the data transfer is amortized due to its incremental property. On the other hand, hydration requires extra disk space on the target site all the time even though the recovery (Test Failover or Failover) may happen less frequently every few weeks or months.
With Rubrik Orchestrated Application Recovery, you are given the flexibility to customize your DR solution depending on different service requirements. You are enabled with CDP to achieve near-zero data loss for your mission-critical applications, but you can still use snapshot-based backup for your ordinary workloads. You can enable hydration to achieve an aggressive goal of minimal downtime, but you are also allowed to use traditional export to avoid the extra space required by hydration.