Managing and Monitoring SLA Domains at Global Scale
In my previous post, I went into the complexities that funnel into building Service Level Agreements (SLAs) that exist between consumers and providers of an IT service. This friction can be greatly assuaged by decoupling the agreed upon policy’s intent from the actual execution of backup jobs. It allows administrators to abstract away much of the low-end fuss required to build and maintain data protection, instead focusing on adding value at a more strategic level across the organization. Let’s now move the story forward to discuss how consumers can easily determine if their SLAs are being honored.
At a high level, SLA Domains are constructed using Recovery Point Objective (RPO) and retention values. The RPO is essentially asking how much data loss the consumer is willing to tolerate, while the retention input determines where the provider will store data (on-premises or elsewhere). To understand SLA compliance, it’s important to look at the entire set of backup jobs to ensure all facets of the RPO are being met for an application. This goes beyond looking at the number of total backups held by the system, as an RPO is often expressed as a quantity of hourly, daily, weekly, monthly, and yearly backups. A consumer could ask for many recovery point options as part of their SLA, such as every 4 hours along with a daily, monthly, and yearly backup. Extrapolating all of these data points into a simple compliance report is a challenge, especially if data concerning the backup jobs are held in disparate legacy systems such as tape libraries or catalog servers.
Rubrik understands SLA compliance. Every application within an SLA Domain is held against the defined policy. As backups age, data is pruned to adhere to the RPO values required by the SLA. If a backup task fails due to issues with the application or hypervisor, attention is given to the application that is breaking away from the SLA so that an administrator can provide remediation. This method provides a robust method for truly understanding compliance against an SLA, rather than relying on the number of backups or working with backup tools that rely on an additional archival system to prune aging backup points that are no longer necessary. It also grants administrators a fine-grained troubleshooting workflow since only the specific workload that is experiencing issues needs attention, rather than all workloads adhering to an SLA Domain.
With compliance out of the way, it’s important for administrators to have a strong grip on their utilization and capacity metrics on a per-SLA basis. Often, data protection tools focus on specific backup jobs as their granular unit of reporting or require the use of user-defined tags. And while tags can be handy for manually building metadata across workloads for some use cases, It can be tough for an administrator to wrap their arms around the data being protected without putting in a fair bit of scripting or reporting sweat equity. We’re of the opinion that this time can be better spent elsewhere offering real value to the business, rather than playing a game of hide-and-tag from a hypervisor management interface.
SLA Domains should be first class citizens. Rubrik’s interface provides the user interface and API endpoint with a heavy focus on a friendly user experience. Every SLA Domain provides a rich set of overview data to make at-a-glance visibility a snap. This includes per-SLA storage utilization, policy details, quantity of applications being protected, total data points, and the ability to dig into every data point with real-time search. The goal here is to provide relevant metrics for easy consumption and capacity planning without turning the UI into a confusingly gizmo-laden fighter jet cockpit.
If an administrator needs to get deeper details on a virtual machine, Rubrik provides her with a full data protection lifecycle review. She can then tune any specific setting, such as the SLA Domain assignment, any live snapshots used for Instant Recovery, how much data resides on-premises versus in the cloud, and review a detailed history of all backup jobs.
Wrapping up the second post in this series, we’ve discussed protecting workloads with simple Service Level Agreements (SLAs) across a global fabric while also maintaining tight control and real-time metrics on SLAs and individual workloads. By abstracting away the complexity of daily operations to an intelligent data protection platform, administrators can finally stop managing backups and start managing backup policy. For more on this topic, catch the upcoming VMworld session STO6287 entitled Instant Application Recovery and DevOps Infrastructure for VMware Environments – A Technical Deep Dive featuring Rubrik’s Arvind Nithrakashyap, CTO, and myself at VMworld US and EMEA. Additionally, I plan to finish the series with a third post focused on removing the pain surrounding SLA compliance reporting – stay tuned!