SLA Domains Explained

In the data protection world, Service Level Agreements (SLAs) define protection levels for workloads, availability targets, and objects that are crucial to a company. Collecting this information, implementing it, and staying compliant with the SLA is usually a tedious and difficult process. Rubrik uses SLA Domains to make those SLAs easier to achieve. 

SLA Domains are comprised of these three components:

  • Snapshot protection and retention 
  • Replication 
  • Archival

When combined, these components protect and help manage the lifecycle of your most critical data. SLA Domains can be set to align with your company’s own SLAs, making compliance incredibly simple. Additionally, Rubrik’s SLA Domain construct is declarative in nature, allowing for a set it and forget it style of implementation.

While it may seem like a small thing to use declarative policies, there is a major difference between old-style imperative backup jobs and Rubrik’s declarative SLA Domains. To understand the difference, imagine a situation in which you want to give your friend a LEGO set for them to build. The imperative approach would be to sit down with your friend and tell them, step-by-step, exactly how to construct the model. It would be far simpler, and faster, to take a declarative approach by defining what the end state of the model should be and then let them decide how to create it.

Our customers have also saved a ton of time—in many cases, as much as 90%—by utilizing our SLA Domains. One of the biggest benefits of policy-based management is that you can easily align SLAs with business needs and no longer have to undergo the manual process of scheduling jobs. 

SLA Domain Protection

You can add any workload type Rubrik supports to an SLA Domain. Rubrik is continually expanding its support and increasing SLA Domain capabilities. These capabilities are designed to make data protection even simpler and more automated across your environment. One such capabilities is auto protection, where SLA Domains auto-assign protection in a hierarchical manner to child objects when a parent object is defined as protected. When you assign a top-level item, like a Database Server or VMware’s vCenter Server, to an SLA Domain and add a new virtual machine or database, it inherits that protection.

You can also view SLA Domains on other Rubrik clusters being used for replication. This is done using the Remote Domains page. The only setting you are allowed to modify on the remote SLA Domain is the archival policy. This prevents possible confusion resulting from multiple locations making changes to the SLA Domain snapshot frequencies.

Crafting SLA Domains

SLA Domains have a number of different configuration settings to adapt to different needs. Configuration settings can then vary based on other decisions. Here is a quick primer on SLA Domain configuration options:

Basic SLA Domains
SLA Domains are created differently than legacy backup jobs. Instead of creating the job and then adding all the details, such as what time to start, verification needs, and so on, you tell Rubrik four things:

  • How often you want snapshots taken 
  • How long to keep them 
  • If and when you want snapshots archived 
  • If and when you want snapshots replicated to another cluster

This is how that looks in the UI:

Each main field correlates to a time period. Define a specific window to take the snapshots by using the Snapshot Window field. The Take first full  field enables you to take a snapshot instantly, schedule it, and wait until the first frequency or snapshot window.

Once the snapshot protection is filled out, the Remote Settings link on the bottom allows you to configure archival and replication to another Rubrik cluster. Archival can be sent to one of many different locations

At a minimum, you need to fill out one of the frequencies, retention period, and a name for the SLA Domain. Requiring this small amount of information makes them flexible for creative use cases such as maintenance snapshots, on-demand snapshots you create before performing scheduled maintenance on an object. Maintenance SLA Domains are created with just a single frequency of hourly or daily and retention and only used to manage retention. They are more useful than other types of snapshots, as you don’t need to worry about delta disks or managing the snapshot later. We will go deeper into this concept in an upcoming post on on-demand snapshots.

Advanced SLA Domains
In version 5.0 and later of Rubrik’s Cloud Data Management software (CDM), we added an advanced option that turns on several new fields on the SLA Domain page. Rubrik added these due to requests by customers who needed to better align with business SLA constructs, such as weeks and quarters. These are accessed by toggling the advanced switch on the SLA Domain page. 

These new changes allow customers to be more granular with their backups and when they are run. When I was a sysadmin, I ran weeklies on Fridays in order to capture all the week’s work. Any day of the week can be specifically chosen for your weekly frequency. The advanced option now allows you to be extremely precise when snapshots are chosen to better work with your business SLA needs.  

SLA Domain Design Considerations

Some SLA Domains are created to fulfill specific requests or compliance rules. For all other SLA Domains, there are a number of approaches gathered from customers and personal experience that I find useful. Here are three use cases in which SLA domains have proven to make a difference:

Adaptive Backup  
A customer is experiencing high resource utilization in their VMware environment and is currently unable to buy additional hardware due to budget cuts. They need to protect their data, but they also need to make sure that no additional load is introduced by backups. Backups are difficult to schedule, as their developers launch CPU intensive tasks at random hours due to the nature of their work schedule. This would normally make setting a normal backup job on other systems difficult. Rubrik software has a feature called Adaptive Backup that addresses this problem. 

Adaptive Backup leverages user-configured criteria so that backups occur based on system resource availability. This is off by default and configured under Settings -> System Configuration -> Adaptive Backup. Configurable settings include:

  • Maximum VM I/O Latency (ms)
  • Maximum Datastore I/O Latency (ms)
  • Maximum VM CPU Utilization (%)

If the values exceed the configured limit, Rubrik reschedules the snapshot. Approximately every 15 minutes, Rubrik re-checks the values until they are below the set threshold. If by the third check the required resources are not below the configured thresholds, Rubrik will schedule the backup for the next window. Adaptive Backup can be configured via the UI for VMware, and you can use the API to configure settings for:

  • VMware vSphere
  • Microsoft HyperV
  • Microsoft SQL Server
  • Nutanix AHV
  • Filesets

Generally, Rubrik is well-tuned to minimize impact in the majority of environments. Adaptive Backup is beneficial in cases where you have high resource usage for short bursts of time, which prevents backups or causes slow downs. If used, start conservatively with your settings. For example, set Maximum CPU Utilization to 90%. Configuring these settings conservatively at first will give you a better idea of what threshold is needed to prevent slowdown to production workloads.

Inherited Protection 
Another Customer has a fast-evolving environment that is growing. Due to the high rate of growth in their environment, they constantly need to add machines to backup jobs. They occasionally forget to do this, which has caused problems on multiple occasions. They need to ensure that all machines are automatically protected. Additionally, some of the VMs are infrastructure-related and must be able to have a different level of protection. 

One method some of our customers use is to configure a generic SLA Domain that applies to all workloads and then apply tailored SLAs to objects that need a different level of protection. 

Pre-Configured SLA Domains are available for customers who want to save time with their initial configuration. You can also configure your own generic SLA Domain.  

To satisfy the above customer’s needs, they created a generic SLA Domain and applied it to their VMware vCenter Server. This automatically protected all the objects underneath it. For their infrastructure machines, they directly assigned a different SLA Domain that took snapshots more often and then archived that to their cloud provider. A directly assigned SLA Domain overrides the inherited SLA Domain, allowing them to satisfy all their requirements. 

Occasionally there are reasons to block inheritance. Some examples might be test/dev servers or VM desktops. For those cases, Rubrik allows you to block inheritance using the same hierarchy mechanism. You can block a single object or a higher-level object, which then blocks its child objects as well. 

Multi-use SLA Domain 
This customer manages a complex environment that spans multiple types of workloads. Workload types include NAS file sets, infrastructure-related, and database servers. Although the workloads vary, the customer doesn’t want to spend a lot of time creating and tracking numerous different backup jobs. To protect their data, they create one SLA Domain for their environment. With SLA Domain’s already built-in tuning and versatility, they can use one SLA Domain across all different workloads. 

SLA Domains can easily be applied to multiple object types. Any specific needs (i.e. logs for Microsoft SQL Servers) are handled at the object level, and main protection is handled by the SLA Domain. 

An additional way this customer might set up their SLA Domains is to create one job for each workload type, and then configure as needed for frequencies, retention, and archival. This creates a few more SLA Domains but caters the flexibility and level of protection that may be required by the business. 

Finally, I would offer one additional consideration that we implemented in our own lab: name SLA Domains as descriptive and meaningfully as possible. For example, one of our SLA Domain names is CloudOn AWS DND. This particular SLA Domain archives out to AWS and then utilizes CloudOn to convert to a native AMI instance. The “DND” means do not delete. Create a name that describes what the SLA Domain does.

This is just a glimpse into the benefits and use cases for  Rubrik SLA Domains. Stay tuned for a deeper dive into the advantage and impact of on-demand snapshots. Want an in-depth look at Rubrik under the hood? Read our white paper.