Understanding RPO and RTO
As enterprises utilize more and more business-critical digital services, information technology infrastructure and applications have become key strategic imperatives. Downtime and data loss translate to a huge business and financial impact that must be minimized with an effective data protection strategy.
When planning for a data protection strategy or a disaster recovery plan (DRP), there are several criteria to consider in order to align with the business impact of various applications and workloads.
A Business Impact Analysis (BIA) can help assess and weigh the impact and consequences, both financial and non-financial, of an interruption in business operations. These findings can help organizations determine their availability Service Level Agreements (SLA), or the level of service expected by the customer from the entity that provides the service. Most often, multiple SLAs are defined to match the various levels of criticality that were determined during the BIA.
For example, the following SLAs are commonly utilized:
- 99%, or two 9s, corresponds to 3 days 15 hours and 36 minutes of downtime per year.
- 99.9%, or three 9s, corresponds to 8 hours 45 minutes and 36 seconds of downtime per year.
- 99.99%, or four 9s, corresponds to 52 minutes and 34 seconds of downtime per year.
Availability SLAs are then translated to affordable data loss and unplanned downtime objectives by the IT department or the service provider. Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are two of the most important constructs of an SLA and represent these objectives.
Recovery Point Objective
The RPO represents the maximum amount of data that an organization can afford to lose. To be more accurate, the RPO describes the point in time furthest back you can afford when recovering. From a data protection standpoint, this usually corresponds to how often data must be backed up or replicated. It can be expressed in days, hours, minutes, or even seconds.
Let’s dive into the details a bit. In the example below, the agreed upon availability SLA is 99%, which is a little more than 3.5 days of tolerated downtime. Because the SLA comprises both the RPO and RTO, the sum of the two must be smaller than 3.5 days. To comply with this SLA, an RTO and RPO of 24 hours has been defined. It is possible to recover the failed workload from the latest backup, which is less than 24 hours old. Data that was created or modified between the latest backup and the failure event is lost. However, since it is less than 24 hours worth of data, the Recovery Point Objective is met.
In the example below, the defined RPO is still 24 hours. A failure occurred on Wednesday, but data could not be recovered from the latest backup that was taken on Tuesday evening. The admin has to restore the workload from the previous day’s backup, meaning more than 24 hours worth of data has been lost. The objective is not met in that case.
In order to mitigate this risk, protection frequency could be adjusted to occur more often, while keeping an RPO of 24 hours.
Most workloads, applications, or data sets with a low to average criticality have an SLA of 99% or less and an RPO of 24 hours or more. It is very common for such workloads to be backed up once or twice a day.
For more critical workloads with an SLA of 99.9% or more, it’s necessary to take the RPO down to a few hours, or even minutes or seconds, to be compliant.
Recovery Time Objective
The RTO corresponds to the maximum time under which a failed workload must be recovered.
However, some applications and services can take some time to start, or workloads can first restart in a degraded mode after being recovered. In such cases, the extra time it takes for recovered workloads to be ready to service users adds up to the RTO and is called Work Recovery Time (WRT). The addition of RTO and WRT defines the Maximum Tolerable Downtime (MTD).
In the illustration below, the agreed upon SLA is still 99%, and the RTO is 24 hours. The failed workload is recovered in less than 24 hours. The Recovery Time Objective is met.
On the contrary, in this next figure, the objective is not met because the failed workload took longer to recover than the defined RTO.
There are multiple factors that influence restoration times, including:
- Scope of failure. Failures can occur at different levels of an infrastructure. It’s faster and easier to recover from individual workload failures than from a storage array failure or an entire data center failure, for example. In addition, when a critical piece of hardware such as a storage array goes down, there are multiple ways to overcome the situation. The array itself can be fixed or replaced, or a failover to a disaster recovery infrastructure can be triggered. The latter enables much shorter recovery times than the former.
- Amount and nature of data to recover. In general, the greater the amount of data to restore, the longer the recovery. But at the same time, it is usually shorter to recover a few files than to recover an entire computer or a large database.
- Backup storage performance. Often times, backup storage, also referred to as secondary storage, is cheaper than production storage and does not offer the same performance. Recovery times depend on what the backup storage can deliver.
- Network performance. In most cases, data transfers for restore operations go through the network. Organizations that use 10 GBps (or more) Ethernet networks for backup and recovery purposes benefit from more bandwidth, which in turn helps reduce recovery times. When data has to be recovered via a network shared by different streams, restore operations may take longer due to a reduced amount of bandwidth available.
RPO and RTO, The Rubrik Way
Rubrik has long committed to making data protection simple, secure, reliable, and fast. To enable our customers to improve their RPOs, Rubrik Cloud Data Management (CDM) provides a set of technologies and features that help reduce backup windows and increase the backup frequency:
- Ingestion to optimized storage. Rubrik provides a scalable and distributed file system spanned across all disks in the cluster. Combined with erasure coding, its ingestion performance is comparable to flash storage.
- Parallel ingestion. Rubrik clusters are comprised of a minimum of 3 to 4 nodes, and the solution scales by adding more nodes to the same cluster where everything is distributed. Each node in the cluster can backup multiple workloads concurrently. As a result, the larger the cluster, the more data can be ingested in parallel, and the shorter the backup window.
- Forever incremental. Once the first full backup is done, Rubrik only backs up new and modified data. This also contributes to shorter backup windows and reduces the network bandwidth consumption required.
Rubrik supports advanced integration with Microsoft SQL Server and Oracle. These workloads are backed up at the application-level with the help of the Rubrik Backup Service (RBS) deployed on the corresponding servers. This provides a periodic application-consistent backup of individual or all databases, but also a backup of transaction logs for SQL Server and archive logs for Oracle.
Advanced log processing enables much more frequent point-in-time backup of the most critical databases, where RPO is much lower, typically down to a few minutes. Such processing can be referred to as near-continuous data protection. The figure below shows an example where application-consistent backups are taken every hour and log backups every 10 minutes:
When a multi-terabyte VM, virtual disk, or database must be restored, it usually takes a lot of time for it to be restored and available in the production environment. To address this problem, Rubrik offers these technologies:
- Instant Recovery Specific to vSphere and Hyper-V VM recovery, this feature can publish and power on a VM image directly from the backup data stored on the Rubrik appliance. This type of restoration is destructive, meaning that the new VM object booted from Rubrik replaces the original one. Instant Recovery eliminates traditional recovery times, bringing near-zero RTO to the most critical VMs.
- Live Mount Similar to Instant Recovery, Live Mount can publish an entire VM image, individual VMDK, and individual SQL or Oracle database from the Rubrik appliance back to the production environment. When a Live Mount is performed, the recovered workload is published as a new object, not overwriting anything.
- Fast storage Each Rubrik cluster delivers enough storage performance to sustain multiple concurrent Live Mounts or Instant Recoveries. Specifically, each node has a flash drive that is utilized to cache data published by Instant Recovery and Live Mount.
- Fast networking Rubrik appliances come with fast and modern network interface cards (10GBps and more) in each node. It adds to the fast storage and the Live Mount technologies to offer the fastest recoveries as possible.
When RPOs and RTOs are known for all workloads and applications, as well as the cost of downtime for a given business, the right decisions can be made to protect data appropriately. The IT department is empowered to select the right technologies and build a suitable strategy around data protection and disaster recovery.
Want to learn more about DR? Check out this blog post on planning a successful disaster recovery strategy.