Disasters, whether naturally occurring or maliciously imposed, rarely strike when you expect. In a disaster recovery (DR) situation, you must restore your digital infrastructure quickly and efficiently. Failover and failback are business continuity tools that help sustain normal virtual operations even when your primary production site is disabled.

Think of failover and failback processes as important complementary elements of a robust DR framework. The failover operation switches production from a primary site to a backup (recovery) site. A failback returns production to the original (or new) primary location after a disaster (or a scheduled event) is resolved.

In the event of a catastrophic outage, you can quickly restore any affected system by “failing over” to a copied version. In this context, failover is the transfer of business-critical workloads away from a compromised primary production system and to a designated recovery site— thereby restoring production system operations. Failover mitigates the effects of a disruption by sustaining operations in the face of a potentially debilitating system failure.

The remote (off-site) system copy then is initialized during failover to replace the original system. Depending on the nature of the failure event, you can fail over to the latest system image or to a specific, selected recovery image. Frequently copying system images ensures that you retain multiple system versions and minimizes any data loss. Failover to a curated copy of your system is a cost-effective way to protect against IT system failures.

Following primary site recovery—and resolution of any associated security risks or other failure-related issues—you can restore business operations back to your production system. Failback allows you to recover the pre-disaster image at the original production system (or other selected production location) and restore workloads from the copied system to the designated production system. It is likely, however, that incremental changes will occur in the recovery system following failover. Thus, you must synchronize the restored/new production system with the copied system prior to failback in order to avoid business-critical information loss. When executing a failback, only the interim (altered) data retained in the recovery system should be returned to the new/restored production system.

 

Failover and Failback Summary

Here’s what to expect with failover/fallback:

  1.  As part of any DR plan, an initial step involves copying the primary production site system to a designated recovery site system. Data on the copied system mirrors the data on the source system at the instant of being copied. In the event of a triggering incident, an automated failover to the recovery system initializes.
  2.  During a failover event, production workloads are transferred to the recovery site, although some changes can occur as operations continue. Any changes during a failure event are written to virtual storage associated with the recovery system.
  3. After any failure-related disruption and data losses are resolved—and any known threat is mitigated—the primary production site can resume operations. At that point, the failback operation executes; production workloads return from the recovery site, and interim update data transfers to the primary system. The new/restored production system and the recovery system can then be synchronized.

 

Rubrik’s Solution 

Our replication and disaster recovery solution integrates failover and failback operations into a seamless and comprehensive management framework. Rubrik automates non-disruptive failover testing and application cloud migration to mitigate risk and meet compliance. Rubrik DR software monitors replication tasks and DR failover status, tracks replication policy compliance, and accumulates proactive error/warning notifications. We deliver simplified DR orchestration with failover/failback, testing, and cloud migration.

Learn how Rubrik can help protect your data with class-leading replication and disaster recovery solutions.