At Rubrik, every product release aims to add capabilities that will address our customers’ current pain points with enterprise data management. We started out by protecting VMware vSphere environments, and now support more than 20 platforms. While we are aggressively proliferating to multiple platforms, we are equally focused on refining our current solutions based on real-world customer stories. One such innovation is adaptive data consistency for virtual machines — a capability that drastically improves the lives of enterprise backup admins.
Before diving into adaptive data consistency, let’s go through the basics. Data consistency in recovery points is broadly classified into three categories: inconsistent, crash-consistent, and app-consistent.
An inconsistent recovery point is taken with zero pre-work. It is not suitable for data sets that contain complex and interdependent relationships, as only data captured on disk is backed up. The in-memory changes are not captured, so it won’t be a true representation of the data in the system for that point in time.
A crash-consistent recovery point, when restored, gives you the state of the data from the time of the backup. All the data is captured at the same time, but I/O operations and transactions in process may not be captured. For most modern applications, crash-consistent recovery points may be sufficient.
An app-consistent recovery point captures all of the data simultaneously, just like a crash-consistent recovery point. But it also waits for the applications to flush I/O operations and transactions in process. For apps running in Microsoft Windows-based operating systems, the Volume Shadow Copy (VSS) service helps with creating app-consistent snapshots. For Linux-based systems, the applications may have native tools and/or file system sync tools to help with creating app-consistent recovery points.
When talking to our customers, we discovered that backup administrators had been dealing with two key pain points in this area.
Pain Point #1: Too Many Knobs to Turn and Tweak
Traditional backup solutions require administrators to make decisions on consistency of recovery points in advance. Several ‘knobs’ are provided in backup job definitions, application processing settings, guest operating system settings, and so on, which allows the administrator to make decisions on a per-job or per-VM basis. While this method is flexible, it can get overwhelming, especially when you need to protect hundreds of VMs. Furthermore, the human errors in setting up a backup job or agent attribute can be costly because there is no reporting for data consistency, as the user is supposed to make decisions deliberately.
Pain Point #2: Secure or Easy to Use?
Traditional backups also force the administrators to decide the security profile of the guest operating systems in advance. Most of the VM backup solutions in the market today provide agentless operation if User Access Control (UAC) is disabled in the guest OS. If UAC is enabled, the backups fail or the backup solution blindly performs a crash-consistent backup of the VM. In order to perform an application-consistent backup of UAC enabled VMs, you must use separate backup jobs with agent-centric workflows. Or you have to expose elevated credentials of the VMs to the backup system, which defeats the very purpose of securing the VM using UAC! Now, imagine making this decision for hundreds of VMs.
To address these two problems, Rubrik designed adaptive data consistency.
As you may already know, Rubrik features a custom VSS provider for Windows VMs running in VMware vSphere. There are several enhancements in this custom VSS provider designed to leverage Rubrik’s scale-out architecture. Furthermore, this custom agent handles SQL Server and Exchange differently; it does not break application-level backups of SQL Server and performs application-consistent log truncation for Exchange.
With our Rubrik Alta 4.1 release (also made available in in Alta 4.0.4), our solution eliminates the two pain points above. And the best part is you don’t need to do anything to make use of Rubrik’s adaptive data consistency methodology! The system adaptively determines the best possible data consistency path for the VM. If the chosen path is not the best, you are notified so that you can perform corrective actions.
How does it all work under the hood? Rubrik attempts to contact the VM over network to see if Rubrik Backup Service (RBS) is listening to the VM. If RBS is listening, our platform orchestrates app-consistent backup by making use of custom a VSS agent that is already installed as part of RBS. This agent-assisted method has the following benefits:
- Unlike traditional backup solutions, the administrator does not have to make complex decisions on data consistency, security (UAC), and job setup.
- The method works in both UAC enabled and disabled environments.
- The method does not have the adverse delays generally seen when using vSphere VIX APIs used solely for the purpose of creating app-consistent snapshots
- The actual data movement occurs via NBDSSL transport through ESXi server, hence still an agentless ingest method.
Additionally, if RBS is not listening, Rubrik has the ability to auto-install RBS on the VM. This capability is turned off by default, as it may be considered intrusive to install persistent binaries on a production system. If needed, this capability may be turned on temporarily to streamline the installation process. As this process required UAC to be temporarily disabled, it may not be viable in some environments. Our recommendation is to use automation tools like Group Policies, Chef, Puppet or something similar to automate RBS installation.
If RBS is not active or cannot be installed, Rubrik pushes an ephemeral agent into the guest OS using vSphere VIX APIs. This ephemeral agent has the custom VSS provider. Rubrik attempts app-consistent snapshots of the VM using its custom VSS provider. In the event this relatively slow VIX based interaction makes it impossible to finish VSS snapshot with the (Microsoft mandated) 10 second VSS window, Rubrik will notify the situation and proceed with a crash-consistent backup of VM. The administrator can perform corrective actions based on notification received
Why is this adaptive capability to provide best effort data consistency a big deal? Let’s consider the two main backup admin pain points. Manually tweaking job settings, guest and application processing settings, client attributes, consistency choices, and more is doable with a handful of VMs. But in most cases, you also need to work with application admins/owners of the VMs to choose an approach that meets their recovery requirements. As your environment scales to hundreds of VMs, this is not only hard to do, but the cost of making a mistake can be very high.
Thanks to Rubrik’s adaptive approach, the time consuming but extremely critical decisions on VM and app data consistency are now in self-learning mode. Combined with auto-discovery and the adaptive throttling already baked into the product, Rubrik’s self-driving capabilities got even better!
Want to learn more? Read why we built our own customer VSS provider.