Successful enterprise architects are able to pull functional design elements from key stakeholders to abstract requirements, constraints, and risks. Much of this work involves translating business needs into technology decisions and then deciding upon the right vendor solutions to provide for the design. In this blog post series, I’m going to focus on addressing Service Level Agreements (SLAs) to ensure that the business is equipped with the runway it needs to tackle operational challenges and protect applications. Many organizations that I’ve consulted with were forced to take a good, hard look at their SLAs (or lack thereof) in order to craft a strategic plan for the future.
At the heart of any quality SLA is fairness. Both parties – the consumer and the provider of a service – must agree on a mutually beneficial statement for long term success. The end goal is to abstract the minutiae of a technical design away from the consumer. Such as this WordPress platform: I really don’t concern myself with the back end infrastructure, I just want to consume the service and know that it’s being protected. An SLA is a method for me to define guard rails around data loss and availability while still allowing the service provider to determine the specific method of implementation.
SLAs are a challenge to establish. In many cases, applications are introduced to the data center on a per-project basis without establishing an overarching strategy across the enterprise application portfolio. Because each application is chained to a snowflake SLA with unique Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), IT is left holding the bag to build complex protection tasks across physical and virtual workloads. As the business grows, so does the effort needed to manage and maintain the SLA complexity. In other shops, SLAs are expressed as loose intentions without the needed documentation and strategic buy-in to give them teeth. It’s common to hear “everything needs zero down time” or other unrealistic requirements thrown around when consuming private data center space.
Experience favors building a small number of SLAs and offering them to the business as part of a service oriented architecture. It’s much like a dinner menu for a party of guests: offer up a few different options for the guests to select from, and then move forward with serving the best food possible. Rubrik’s converged data management platform offers three default SLA Domains based on precious metals – gold, silver, and bronze – with corresponding RPO values to define hourly, daily, monthly, and yearly data protection. If there are no SLAs defined today, these make great starting points for a conversation with your stakeholders. Otherwise, IT can edit the RPO values of the default SLAs or even add new ones.
Let’s assume that at this point both the application owner and IT have agreed upon a standard set of SLAs. Their next step is to translate these SLAs into backup jobs that protect the applications within the data center. It’s an arduous task because legacy data protection platforms couple execution with policy. Each backup job must be explicitly told when to run, how often to run, which applications need to be protected, data retention values, and so on. It would be like having to tell your car’s engine, transmission, brakes, dashboard, and electronic systems what to do in order to drive down the highway. We all just want to put our foot on the gas and go, and let the car’s mechanical and electrical systems figure out the rest. Data protection shouldn’t be any different.
Rubrik is the gas pedal for data protection. An administrator feeds in the protection policy specifications into the system as an SLA Domain, which tells the system about the RPO and retention values needed to satisfy and SLA. The SLA Domain is then overlaid upon the applications within the data center that need protection.
For example, a Gold SLA might be applied to critical production workloads in which the business owners have asked for an RPO of 1 hour. Rubrik’s software inventories the workloads, examines the Gold SLA policy, and then takes care of scheduling jobs across all of the Briks within the Rubrik fabric.
You’ve stopped managing backup jobs (complex), and started managing backup policy (simple). A similar paradigm exists in the server world with configuration management platforms (Ansible, Puppet, Chef, etc) so why not with data protection, too?
The sticky part of an SLA is measurement. Without key metrics being monitored and measured against an SLA, there is no way to fairly assess the financial and technical efficiency of the services being provided. In my next post, I’ll dive deeper into how we focus on holistic metrics for all of the applications across the data center and SLA Domains.