The automation of repetitive tasks to ensure consistency and reliability is generally a great focus for technical professionals. This frees up time to tackle other initiatives that help differentiate your business while also limiting the amount of context switching required to get through various projects. The net result is an increase in speed and a more satisfying work experience.

Rubrik’s SLA Domains provide a great deal of automation by scheduling all of the policy-driven snapshots required to meet your defined recovery point objectives (RPOs) for various workloads. However, there are times when ingesting data using an on-demand snapshot makes more sense than relying on the intelligent scheduler baked into Rubrik’s software stack. I’ve spoken with many customers who are actively leveraging on-demand snapshots to tackle some interesting workflows in both manual and automated workflows.

In this post, I will dive deep into the architecture, design considerations, and use cases for on-demand snapshots as part of a larger series on SLA Domains. If you’re new to this technology, check out Mike Wilson’s post that introduces the concept of SLA Domains.

The Advantages of On-Demand Snapshots

Almost all systems have a mechanism in place to store state. This ranges from hypervisor snapshots, such as VMware vSphere snapshots on virtual machines, to simple copy operations, such as a Microsoft SQL Copy Only backups. These are helpful tools for ensuring that state can be extracted and ultimately restored at a future date. Rubrik’s on-demand snapshots offer a powerful alternative to these methods.

Here are some of the advantages of using Rubrik snapshots:

  1. There is no redirection of write I/O and no creation of a delta disk. Production workloads remain preserved as-is without operational performance penalties.
  2. Snapshots can be used for other operations, such as creating Live Mounts (zero-space clones) that other team members can iterate upon or use to generate new workloads. Frank Denneman has written some interesting thoughts on this as well.
  3. On-demand snapshots can choose to use an SLA Domain or opt for “Forever” retention. Data captured as “Forever” has no defined expiration date, is managed in the Snapshot Retention page, and can be purged manually at-will.
  4. For Microsoft SQL, snapshots can be created at an individual or group level. Group snapshots are also available for multiple databases from different SQL Server hosts or instances. On-demand snapshots for grouped SQL Server databases reduce the overhead of individual database snapshot creation.
  5. For Oracle Database, both on-demand database backups and on-demand log backups are captured seamlessly. See the Oracle Databases section of the Rubrik CDM Version 5.0 User Guide for more details.
  6. For VMware vCloud Director (vCD), on-demand snapshots can be taken of the entire vApp or of individual VMs within the vApp. This can even be triggered via our vCD integration, as described by Rebecca Fitzhugh.

These are powerful reasons to consider on-demand snapshots from Rubrik. It’s also important to consider compliance.

SLA Domain Compliance

When an SLA Domain is assigned to a workload, it receives an SLA Compliance rating. Missing a policy-based snapshot, which is scheduled by the SLA Domain itself, may cause the workload to go out of compliance if all attempts fail. This is an expected behavior and highlights trouble spots to you and your team, with detailed activities performed, for remediation.

Since these snapshots are not triggered by the SLA Domain managing the workload, they are not considered for the purposes of SLA Compliance. If including any and all on-demand snapshots is desired, Rubrik Support can make a small configuration change to include them in the SLA Compliance of protected objects. 

Adaptive Backup Settings

Another key design consideration is the use of Adaptive Backup. When enabled, this feature works hand-in-hand with the SLA Domain policy to ensure that snapshots are only taken when the target workload is within tolerance of defined resource limits. It effects the start time of both policy-based snapshots and on-demand snapshots.

As an architect, you have a series of configuration values to consider:

  • VM IO Latency (milliseconds): Latency measures the time taken to process a SCSI command issued by the guest OS to the VM.
  • Datastore IO Latency (milliseconds): Average amount of time taken during the collection interval to process a SCSI command issued by the guest OS to the datastore(s). If virtual disks are excluded from protection, those datastores are excluded from the calculation assuming no other protected virtual disks are sharing the datastore.
  • VM CPU Utilization (percentage): Amount of actively used virtual CPU, as a percentage of total available CPU. This is the host’s view of the CPU usage, not the guest OS view.

Settings for these will vary depending on the use case, and it is advised to enable the feature only when it is needed. As a general rule, latency for storage related activities should be below 10ms (high-water mark) and often below 5ms (typical). CPU utilization will largely depend on your applications, VM configuration, and physical resource utilization, with most workloads being below 95% (high-water mark) and often below 85% (typical).

Note: Datastore IO latency limits require level 3 metrics from the associated vCenter server. To use datastore IO latency limits, enable level 3 or higher metrics on your vCenter servers.

When the Adaptive Backup settings are enabled, the Rubrik cluster performs an Adaptive Backup settings check before starting an on-demand snapshot. When a value exceeds the configured limit, the Rubrik cluster reschedules the on-demand snapshot. After approximately 15 minutes, the Rubrik cluster checks the values again. When the values are below the limits, the Rubrik cluster initiates the on-demand snapshot.

img

The Rubrik cluster continues to reschedule the on-demand snapshot until the values for the VM are below the configured limits. When the values are below the limits, the Rubrik cluster completes the on-demand snapshot.

The Impact of On-Demand Snapshots

Most workloads will have an SLA Domain assigned. This assignment will ensure that the internal logic layers of the Rubrik CDM software is scheduling the appropriate amount of snapshots at the correct times to ensure data is recoverable. This equates to a chain of fingerprints that detail what data to retain for the workload across local, remote, and archive locations.

On-demand snapshots have their own chain of fingerprints to ensure that there is no impact on the recoverable chain(s) generated by the SLA Domain. This introduces a design consideration when wanting to archive data captured by an on-demand snapshot. The first snapshot in the on-demand snapshot chain will need to be deduplicated and then uploaded in full to the archive location.

img

If you are building long chains of snapshots, the impact is minimal. However, for ephemeral maintenance snapshots, it’s often best to use an SLA Domain limited to short-lived local storage and replication. See the “Maintenance Protection” section below for more details.

The Snapshot Retention section houses powerful options for on-demand snapshots, such as changing the Retention SLA Domain and purging unwanted data.

Snapshot Retention

The Snapshot Retention page contains information on all snapshots, both policy-based and on-demand, for all workloads known to the Rubrik cluster. For example, my CWAHL-SQL workload is running on a VMware vSphere VM using Windows Server 2016 with an instance of SQL Server installed. I have taken several on-demand snapshots using a short-lived SLA Domain designed for maintenance work as shown below:

img

Later, I discovered that one of the snapshots is important and fixed a critical issue related to my server. I can revisit the Snapshot Retention section and mark that last snapshot from 3:05 PM with a long-lived SLA Domain to ensure it is retained and archived for a long period of time.

img

Additionally, the tooltip within the snapshot window of the workload reflects the updated SLA Domain associated with the on-demand snapshot. The UI team strives to make it easy to get metadata on different operations regardless of the view.

img

The remaining maintenance snapshots will automatically purge themselves after 7 days since that is the retention value specified by that particular SLA Domain.

Putting On-Demand Snapshots to Work

Let’s focus on use cases for on-demand snapshots that highlight how powerful you and your team become with this feature.

Ad-Hoc Data Protection

Introduce a new workload into the environment and take a snapshot without assigning an SLA Domain. This requires one step:

  • Search for your workload (server, database, file shares, and so forth) and use the Take On Demand Snapshot button.
  • Alternatively, use one of our SDKs to request an on-demand snapshot for your workload.

This provides additional layers of recovery as you setup the workload, configure the services, and perform tasks necessary to get started. This is especially helpful in pre-production environments where you and your team may be less familiar with the workload’s requirements and dependencies.

Data will be retained according to the policy components within the SLA Domain associated with the on-demand snapshot(s), giving you easy opportunities to restore to earlier versions or create a Live Mount for other team members to collaborate together.

Maintenance Protection

Myself and many other customers prefer to take a fresh on-demand snapshot just prior to making changes to a workload. This gives you an easy restore point to a known-good state. Additionally, you gain much more value with a backup than you do with a hypervisor snapshot. There is no need to redirect write operations to a secondary disk image, nor a need to consolidate those writes back to the main disk image when removing the snapshot, when taking a backup. Plus, you won’t have to hunt down rogue snapshots later.

For this, I suggest creating a custom SLA Domain for on-demand maintenance work. With on-demand snapshots, only the maximum retention and remote configuration settings of the associated SLA Domain are applicable. This is because the scheduler portion is not applicable when creating a snapshot on-demand – you are the scheduler!

In the lab, I have an SLA Domain named Maintenance DND (Do Not Delete). This string is appended to SLA Domains in our internal shared environment as a simple flag for long-lived objects that should not be purged. The SLA Domain is configured with 7 days of retention along with replication to a Rubrik Cloud Edition running in Azure US West 2 (Washington). The replication setting provides an additional layer of protection should something happen to my local Rubrik cluster in California.

img

When it’s time to make a change to a workload, trigger an on-demand snapshot using the Maintenance SLA Domain to ensure that there is a strong recovery point ready. The snapshot data will be automatically be removed after the retention period expires – in this case, 7 days – negating the need to go back and clean up later.

For some customers, on-demand snapshots are part of their change process. In one example, any Jira request to deploy a new service or code onto a server will automatically require an on-demand snapshot before the change is approved. This step is automated and reported in the request, removing the chance of human error and the effort required to take an on-demand snapshot.

Deployment of the Rubrik Backup Service (RBS) Connector

Environments that want to leverage the Rubrik Backup Service (RBS) can use a manual or automated deployment option. Visit Application Configuration > Guest OS Settings > Connector Settings to change the option.

  • Manual: Deployment is done by installing the service by hand or by using a package deployment service, such as Microsoft System Center or Red Hat Satellite. 
  • Automatic: Deployment is performed by the Rubrik cluster when taking a policy-based snapshot or on-demand snapshot.

When the automatic option is enabled, on-demand snapshots can tackle the deployment of the connector on your behalf. View the Activity Details of the workload to see information on the process.

img

With the connector deployed, the workload now has a library of different quiesce methods for different applications. This works even if the workload has no SLA Domain assigned to it.

Conclusion

While the ability to take a snapshot on-demand may seem like a minor feature, the architecture of Rubrik’s platform makes it fairly versatile in the hands of a technical professional who understands the power of data. Most of the customers that I spoke with find the maintenance and DevOps-centric use cases to be the most popular in their organizations because it removes several layers of risk when introducing change – such as patches or upgrades – into their environment. As more elements of data are captured, indexed, and made immutable for automated workflows, the importance of having a tight grasp on your data life cycle will increase. Please do give some of the design elements in this post a try and let me know how it works out for you.

Do you have a great idea that wasn’t mentioned here? Let me know about it so I can feature your architecture in the future!