Since day one, Rubrik was built with the user in mind, delivering ultimate simplicity and comprehensive data protection at enterprise scale. That’s why we leverage our own platform to manage and protect our mission-critical data, allowing us to experience firsthand how our technology accelerates product development and automation.
However, just like any other organization, we at Rubrik are still susceptible to unpredictable hiccups and disasters. We recently encountered one of these issues during a product release cycle. With Rubrik, we were able to quickly meet our time-to-market goals and minimize business disruption and engineering productivity loss. Here’s our story of how Rubrik came to the rescue:
At Rubrik, one of our core values is velocity, and that applies to our engineering teams as well. Our engineering teams use our home-grown orchestration layer to allocate precious infrastructure resources for functional/integration and regression tests, which enables us to deliver quality product at lightning speed. This orchestration layer and its services run on top of an enterprise SAN (hybrid flash-based array) for storing the test data and builds. However, recently, this storage layer behaved erratically, significantly slowing down our build times and release pipelines. Our IT infrastructure teams came to the rescue, quickly detected and triaged the issue, and were able to restore the services by evaluating / deploying various options such as the below.
The Rubrik Live Mount capability delivers instant clones for easy disaster recovery without impacting production environments. As part of the triage, we evaluated these three options:
- VM Live Mount: Perform a Live Mount of critical VMs, and directly run these VMs from our backup appliance. This is the simples method to quickly restore a service/server.
- Database Live Mount: Perform a database application Live Mount by directly mounting the database to another database server running on high-performance SAN. This is only for database services, and helps us to recover and restore these services.
- Cloud Live Mount: Perform a failover of critical VMs and services to the public cloud. This is performed ideally during a site-level failure or disaster where multiple servers/services are affected. It is ideally conducted as a part of a BCP plan when a site-level disaster occurs.
After weighing these options, we quickly zeroed-in on VM Live Mount to achieve our mean-time-to-recover (MTTR) goals.
For our infrastructure operation teams, our best practice is to perform a thorough post mortem after recovering from the incident. The post mortem revealed that the impact was due to the SAN performance, and the workloads running were observing increased latency, which screeched to a halt our response times and pipeline. Upon further research, we found that the storage array had a bug in which the garbage collection wasn’t kicking in as expected. In addition, the garbage collection schedule was competing with our workloads running on the array for the resources, impeding the overall performance.
Life happens. Disasters strike. But with Rubrik, say ‘bye’ to all nighters and ‘hello’ to weekends. Stay tuned for more stories on how we’re using our own technology to unlock new value from our data. And check out this blog post to learn more about Rubrik on Rubrik.