Introducing Orchestrated Disaster Recovery for VMware Applications and Data

Unplanned downtime and data loss caused by natural disasters and modern cyber threats represent some of the most challenging events facing organizations today. Maintaining capabilities to reduce or eliminate impact in these scenarios is critical to any business continuity plan. Organizational resilience depends upon being able to protect and ensure the recoverability of data and services wherever and whenever disaster strikes.

These challenges from modern disasters can lead teams to purchasing more than one disparate vendor’s solution with no simple path or option for integration amongst them. Trying to form a complete solution with products from these varying sources can slow rollout and increase costs over time.

Tackling Management Silos

AppFlows eliminates the need for point solutions as an extension to the capabilities of the leading Rubrik data management platform. The simplicity in this approach can help organizations avoid unnecessary product integration costs and reduce risk to their business. With the combination of CDM appliances and the SaaS delivery model of the Rubrik Polaris platform, Rubrik unifies backup & recovery, continuous data protection (CDP), replication, and disaster recovery functionality under one, easy to deploy and manage solution.

What’s Covered?

Rubrik Polaris AppFlows offers disaster recovery orchestration for on-premises VMware environments as well as protection and migration of on-premises VMware environments to VMware Cloud on AWS (VMC). Because ransomware is top of mind for many of our customers, another exciting component we’ll be adding is local, in-place recovery for ransomware attacks.

Rubrik Polaris AppFlows is designed to serve multiple use cases building upon the value of the Rubrik CDM platform:

Backup for critical VMs with RPOs as low as 60 seconds using Continuous Data Protection (CDP)
Automate local and remote disaster recovery to slash downtime and minimize data loss
Demonstrate compliance with DR testing “fire drills” that do not interfere with production
Ransomware remediation – integration with Rubrik Polaris Radar to identify impacted applications and rapidly recover them in-place using the Radar-recommended points in time just prior to the data loss event
Protect and/or migrate VMware on-premises to VMware Cloud on AWS (VMC)

How it Works

Appflows leverages the data protection SLAs and two-site replication topology of Rubrik CDM clusters as its data plane. Operating in Rubrik Polaris as a SaaS application, users define application Blueprints and execute command and control from the cloud. The Blueprints are used to orchestrate the existing backup data of each VM for disaster recovery; both in-place and remotely to the opposite site. Data is either staged pre-emptively to a production vSphere Datastore, or copied on-demand allowing for tiered DR service definitions and reducing the need for available storage capacity until it is actually required. Both failover and failback are completely automated. For local, in-place recovery, data is rewound to the desired point in time leveraging vSphere change-block tracking (CBT) technology to ensure the low RTOs.

Define a Blueprint

Blueprints are the building blocks that define a group of VMs as part of the same business service or application. Each VM within the Blueprint uses its existing SLA assignment including replication topology and backup frequency to enable data movement to the alternate site. Failback is handled by assigning a secondary SLA with replication in reverse for the VMs to use when they arrive in the DR location. This secondary SLA assignment does not require a new snapshot chain or new full backup. This is due to changes in the CDM 6.0 release that include a capability for linking the snapshot chains between sites and the disparate VM UUIDs. A resource map that includes target compute, storage and network is defined for each virtual machine. The resource maps are monitored for alignment with the vCenter infrastructure and alarms are triggered should they become misaligned and no longer usable. Boot order groups are used to ensure each VM is started in the order that makes sense for the application they support. Lastly, custom post-scripts may be defined for each VM and executed at the privilege level of the Rubrik Backup Service (RBS) agents.

Blueprint Attributes:

Name
Source & Target CDM clusters - declares the existing data replication topology to use for failover/failback
Virtual Machines
Boot Order - each VM is assigned to one of up to 5 boot order groups enabling orderly multi-tier application startup
Resource Mappings - Compute, Storage and Network* mappings for each VM in the DR site
Post-scripts - user-defined commands or scripts to run per VM during the verification phase of the failover/failback event
DR Site SLAs - ensures protection after failover and replication back to the primary site for simple failback

*Includes separate DR failover and test failover network definitions

Tighter RPOs with Continuous Data Protection

Rubrik CDM 6.0 will introduce an extended recovery journal of up to 24 hours and can continue delivering RPOs as low as 60 seconds as it does today. AppFlows Blueprints can contain VMs protected by CDP and use those recovery points during either local in-place recovery or remote DR. While certain workloads may require the use of CDP to meet data loss objectives, it may not be cost-effective to apply this level of protection to workloads such as web servers or other application services with lower change rates. It is to this end that a mix of SLA assignments is permitted within Blueprints to ensure the most resource efficient protection possible.

Failover / Failback

Both failover and failback are accomplished in the UI by selecting one or more Blueprints upon which to act and clicking the desired action button to launch the recovery page. Once in the recovery page, each Blueprint’s VMs are listed, including the latest available recovery point and an option for viewing and selecting alternatives. An error handling option must be selected for the job and includes the options to Pause, Ignore and Continue (only for real failover), and Abort and Cleanup. Once the error handling option and all recovery points have been locked in, the recovery is ready to begin and is launched with one additional click. If executing a real DR failover or local in-place recovery, then you will be prompted to confirm the action as it can be destructive if launched unintentionally. If launching a failover test, the prompt to confirm is not presented and the job will be launched immediately.

Once a failover is started, the recovery of VMs in each Blueprint will execute the following sequence of events:

The recovery point in time selected for each VM is made ready on the resource map-defined Datastore. This can be a full copy from the CDM cluster using Export, or if staged using Incremental Export, the latest recovery point is already in place. If using Incremental Export, but do not want to use the latest recovery point, then CBT is used to rewind the image to the desired point in time by copying only the changed blocks from the CDM cluster and overwriting the latest recovery point in-place.
Once the image is in place, the VM is imported into vCenter inventory.
Following the defined boot order priority, each VM is powered on.
Once booted, AppFlows awaits the start of VM Tools and begins the network re-configuration.
After the network is configured, the VM is rebooted one last time and AppFlows awaits the start of the RBS agents and executes the post scripts.

Along this sequence of events, should an error occur, the error handling option selected when specifying the job’s parameters will kick in and take action. In the event of a real failover job launching when the primary site is still online, the VMs in the primary site will first be powered off in advance of the steps described above in order to avoid any split-brain like effects resulting from two copies of the same VM running in two different sites.

After arriving in the DR site, all VMs will continue with their backups as defined within the secondary SLA assignment selected at the time of Blueprint creation. It is with this SLA that continuing VM backups are captured while in the DR site and are then replicated either back to the primary site or to an alternate site; whichever is defined in the secondary SLA. When failback is launched, the job will present an option to first capture an on-demand snapshot and replicate it to the failback site to ensure a very tight RPO. Alternatively, the user may simply decide to use what is available resulting from the assigned SLAs and choose expediency for the failback over the option for zero-data loss.

Ransomware Remediation for Applications

None of what you have read about AppFlows features to this point is terribly unique compared to some of the incumbent DR solutions in the marketplace today. Understanding this fact, and the fact that modern disasters are increasingly occurring in the form of cyber attacks such as ransomware, we decided to up our game. The combination of Rubrik Polaris Radar and AppFlows will be uniquely positioned to increase visibility for data center operators working to remediate their impacted business services after a ransomware attack encrypts their data. Using an AppFlows enhanced workflow, Radar will now identify impacted VM membership within Blueprints and present an easy means to rapidly recover to a point in time prior to the encryption event for entire applications or business services. This means that rather than spending significant time to triage and remediate individual VMs with limited understanding of their relationships, operators can now focus on what is most important; an application-focused recovery enabling rapid restoration of business critical services.

DR Testing and Compliance Reporting

One lucky thing about disaster recovery plans is how infrequently they need to be executed. That being said, readiness is predicated on an operator’s ability to demonstrate the plans are functional and will serve the business when the time comes. The best way to do this is to test, and test often. With AppFlows, DR plan testing is easier than ever and compliance reporting is included out of the box. DR Compliance Reports will include details about the execution of each test including dates, times and details for each step of the process. Better still, and this one is my personal favorite, cleanup is an automated process. With two clicks you can sweep the board of all of the VMs copied into the DR test environment. No more long hours mopping up vCenter for the operator that draws the short straw at the end of the testing cycle. Given this capability, automated testing that runs on a schedule and produces reports is now a true possibility and can save significant man hours for what is otherwise known as a labor intensive evolution.

Rubrik AppFlows can eliminate the need to manage a separate DR solution for VMware environments. Through orchestration integrated with data protection, it allows organizations to consistently meet their service levels in a more predictable and efficient manner, including critical DR capabilities at a lower total cost of ownership.

Products

Solutions

Knowledge Hub

About Us