Architecture

Rubrik -  - Erasure Coding or: How Rubrik Doubled the Capacity of Your Cluster

Erasure Coding or: How Rubrik Doubled the Capacity of Your Cluster

At Rubrik, we’re big believers in data protection. But until we’re able to take consistent snapshots of our brain state and upload them to the promised hierarchical neural interconnect, we’re going to focus on backing up the more traditional machines — the ones whose smooth functioning will enable this cause. Any complete backup solution needs a distributed, scalable, fault-tolerant file system. Rubrik’s is Atlas, which made the switch from triple mirrored encoding to a Reed Solomon encoding scheme during our Firefly release. To help you understand the motivation behind this change, this post introduces erasure coding and compares the two methods. What is Erasure Coding? Suppose we want to store a piece of data on a fault-tolerant and distributed file system. In this case, the loss of any single drive should not result in data loss. The only way to achieve fault tolerance is through redundancy, which refers to storing extra information about the data across different drives to allow for its complete recovery in the event of a failure. The more redundancy we add, the greater the fault tolerance. However, the cost of redundancy is increased storage overhead. Every file system needs to make this tradeoff between availability and overhead. At Rubrik, the…
Rubrik -  - Proactive, Real-time Monitoring and Alerting for Customer Engagement

Proactive, Real-time Monitoring and Alerting for Customer Engagement

One of our main values is to deliver world-class customer experience. This entails actively engaging with customers to better understand their needs and to build products with the user in mind. An important tool for this type of engagement is a system reporting framework that allows our software to actively report data and to ensure we take the best possible course of action to continuously improve customer experience and the quality of our product. The Importance of Proactive Customer Engagement While proactive monitoring is not a new concept, many companies still rely on reactive support. Reactive customer engagement is expensive and time-consuming while proactive engagement focuses on non intrusive, metric-based optimizations to the performance of a product. At its core, it emphasizes an ongoing, elevated customer experience. Additionally, a proactive strategy requires cross-functional collaboration to drive success. At Rubrik, we monitor data in real-time and leverage analytics to: Predict customer needs before they even happen Optimize and analyze the performance of our software Alert for abnormal behavior to improve product quality. In summary, it allows us to understand how our services are performing, identify potential areas of risk, and obtain a holistic view on the overall health of our systems…
Rubrik -  - Do No Harm: How We Schedule Backups Without Impacting Production

Do No Harm: How We Schedule Backups Without Impacting Production

One of our principal tenets is to “do no harm” to production systems. The act of taking a backup requires several resources from the primary environment, including CPU, I/O, and network bandwidth. Rubrik’s objective is to minimize the impact to production and VM stun as much as possible. We achieve this for both virtual and physical workloads by using the following three methods: Flash-Speed Data Ingest: One way we adhere to our “do no harm” principle is through a high-speed data ingestion engine that can easily handle large volumes of data. Data enters Rubrik through the flash tier and lands on spinning disks, minimizing the time that we spend communicating with production infrastructure. Policy-driven Automation. The user specifies when and how often Rubrik takes backups by setting up SLA policies to signal the times when production systems have the least load. This SLA-based backup scheduling makes creating a backup faster, taking advantage of idle cycles on a primary system. Intelligently Distributed Workflow Management: Lastly, Rubrik only takes backups when it detects that the production system is not loaded. Users can set thresholds for certain metrics, such as CPU utilization and I/O latency, globally, or on a per-object basis. Rubrik monitors…
Rubrik -  - How We Built More Efficient Data Archival with Cloud

How We Built More Efficient Data Archival with Cloud

The move to cloud is no longer a question of if but rather when. However, enterprises are still confused on how to adopt a cloud strategy within their own environments. As our CEO Bipul Sinha stated at the Looking AHEAD Tech Summit, in order to increase cloud adoption, “companies need to create killer applications to leverage the cloud.” At Rubrik, we create applications that help enterprises transition to cloud seamlessly. The first step in the path is to archive the backups. The challenge of archiving to public cloud is ensuring that data can be pulled down into an on-premises location without breaking the bank or your recovery time objectives. This is where Rubrik works its magic. When Rubrik manages your data, it keeps a record of the metadata that is quickly accessible without data rehydration. You can locate VMs and files instantly with Google-like search. Just type a few letters into Rubrik’s predictive search engine, and you’ll get served results instantly. In this post, I will describe how Rubrik archives data and makes data rehydration fast and efficient. Snapshot Upload We have jobs running per VM that archives snapshots depending on the configured SLA policy for that VM. When an…
Rubrik -  - If I Could Build Anything (Datacenter Edition)

If I Could Build Anything (Datacenter Edition)

As a field engineer at Rubrik and a former customer, I have seen data centers (DCs) containing all types of vendors and products. However, I have never been able to build an environment from a blank slate—or what’s known today as “greenfield.” In this post and subsequent webinar, I’ll explore the characteristics of a greenfield data center using two of my favorite companies and their products. Objectives: As guiding principles, I’m aiming to do three things in each area of the DC: Simplify: Deliver the most bang with the least overhead Soar: Accelerate business-critical unencumbered Shrink: Represent the smallest physical and environmental footprint Below, I walk you through all aspects of the DC, incorporating each of these principles to create simple, modern, and scale-out architecture. Applications and Compute Technology exists to enable business needs. Without the business, the bits, bytes and architectures are fun, but lack a unifying reason to live. Thus, for our purposes here, I’ll assume we’re cutting and pasting a modern, virtualized environment into this ideal data center. It runs services like Oracle, SQL, and Exchange, and holds terabytes of files and application data that are core to the enterprise. Following Moore’s Law, the compute layer has…
Rubrik -  - A How-To Guide on Rubrik’s vRealize Automation Integration

A How-To Guide on Rubrik’s vRealize Automation Integration

As the modern data center becomes increasingly more software-defined, it is critical to select technology platforms that will support this new architecture even if you aren’t providing IT-as-a-Service today. The best way to prepare for the future is choosing to implement modern API driven solutions into your environment that automate consumer-oriented services. Rubrik’s API-First Architecture Lets You “Backup-As-A-Service” The Rubrik platform was purpose-built to address this need. Rubrik is an API-driven solution. In fact, the Rubrik UI is simply a consumer of the underlying REST APIs exposed by every Rubrik system. What does this mean? It means that everything in Rubrik is easily automated by REST APIs. Since just about every automation platform supports RESTful APIs, you are free to use whatever software you like or even write your own homegrown solution for Rubrik. Now you can add Backup-as-a-Service into your ITaaS portfolio! An Example Let’s say you have an Infrastructure-as-a-Service (IaaS) solution in place (vRealize Automation or vRA for example) where users can self-provision systems. You’re still likely using a manual or semi-manual process by which people request data protection for those self-provisioned systems. In a typical workflow, this would be a form the user would submit, and that form…
Rubrik -  - How We Built a Suite of Automated End-to-End Tests

How We Built a Suite of Automated End-to-End Tests

Last week, I covered the importance of quality and why we employed automated end-to-end testing. In this post, I explain how we implement this approach. We do so through a release pipeline orchestrated by Jenkins to efficiently run a large suite of end-to-end tests. These tests leverage our custom testing framework which integrates with support tooling. As we receive customer feedback, we continuously update the framework and test cases to keep up with the latest requirements. Below, I describe in more detail our release pipeline, testing framework, and product support functions that ensure our testing is faster, more efficient, and always high quality. Jenkins Continuous Integration Like many engineering organizations, we use Jenkins as our continuous integration tool. As engineers check in new code, Jenkins is continuously running the suite of tests we built, including both unit tests and end-to-end tests. This allows us to quickly detect and correct issues. Release Pipeline If we ran every full test for every code check in, we would quickly exhaust all our test resources and file duplicate bugs. Although we can easily add more test resources, duplicate bugs waste engineering time by requiring extra triage and diagnosis work. Instead, we define a release…
Rubrik -  - Automation Rules the Kingdom: Why Quality is Important For You

Automation Rules the Kingdom: Why Quality is Important For You

Automated End-to-End Testing: Ensuring Quality We take quality seriously. For both our customers and developers’ satisfaction, it is essential to provide consistent product performance and speed of development with confidence that existing use cases are not broken. To ensure agile development, here’s why quality is essential to your organization and how our strategy makes automated end-to-end testing fast, reliable, and responsive. Importance of Quality For Customers: Every company claims to deliver high quality to their customers, but this is especially critical for Rubrik. Our product is responsible for managing highly valuable data that powers our customers’ businesses. In the backup and recovery industry, our solution needs to be on active duty at the exact moment our customers experience trouble within systems protected by us. Given that these problems are complex, providing an extremely simple user experience alleviates troubleshooting. Of course, this simple user experience can only be simple as long as all the underlying pieces are performing reliably. For Engineers: Engineers want to innovate without breaking existing functionality that customers depend on. If the fundamentals fail, customers cannot upgrade without losing data. It’s often difficult to innovate without affecting the interoperating pieces. In Rubrik’s case, we integrate at all levels…
Rubrik -  - Here’s Why You Should Shelve Backup Jobs for Declarative Policies

Here’s Why You Should Shelve Backup Jobs for Declarative Policies

Changing out legacy, imperative data center models for the more fluid declarative models really gets me excited, and I’ve written about the two ideas in an earlier post. While the concept isn’t exactly new for enterprise IT – many folks enjoy using declarative solutions from configuration management tools such as Puppet – the scope of deployment has largely been limited to compute models for running workloads. The data protection space has largely been left fallow and awaiting some serious innovation. In fact, this is something I hear from Rubrik’s channel partners and customers quite frequently because their backup and recovery world has forever been changed by the simplicity and power of converged data management. To quote Justin Warren in our Eigencast recording, backup should be boring and predictable rather than exciting and adventurous because the restoration process failed, and you’re now responsible for missing data. That’s never fun. Thinking deeper on this idea, it brings me to one of the more radical ideas that a new platform brings: the lack of needing to schedule backup jobs. Creating jobs and telling them when exactly to run, including dependency chains, is the cornerstone of all legacy backup solutions. As part of their…