Architecture

Rubrik -  - If I Could Build Anything (Datacenter Edition)

If I Could Build Anything (Datacenter Edition)

As a field engineer at Rubrik and a former customer, I have seen data centers (DCs) containing all types of vendors and products. However, I have never been able to build an environment from a blank slate—or what’s known today as “greenfield.” In this post and subsequent webinar, I’ll explore the characteristics of a greenfield data center using two of my favorite companies and their products. Objectives: As guiding principles, I’m aiming to do three things in each area of the DC: Simplify: Deliver the most bang with the least overhead Soar: Accelerate business-critical unencumbered Shrink: Represent the smallest physical and environmental footprint Below, I walk you through all aspects of the DC, incorporating each of these principles to create simple, modern, and scale-out architecture. Applications and Compute Technology exists to enable business needs. Without the business, the bits, bytes and architectures are fun, but lack a unifying reason to live. Thus, for our purposes here, I’ll assume we’re cutting and pasting a modern, virtualized environment into this ideal data center. It runs services like Oracle, SQL, and Exchange, and holds terabytes of files and application data that are core to the enterprise. Following Moore’s Law, the compute layer has…
Rubrik -  - A How-To Guide on Rubrik’s vRealize Automation Integration

A How-To Guide on Rubrik’s vRealize Automation Integration

As the modern data center becomes increasingly more software-defined, it is critical to select technology platforms that will support this new architecture even if you aren’t providing IT-as-a-Service today. The best way to prepare for the future is choosing to implement modern API driven solutions into your environment that automate consumer-oriented services. Rubrik’s API-First Architecture Lets You “Backup-As-A-Service” The Rubrik platform was purpose-built to address this need. Rubrik is an API-driven solution. In fact, the Rubrik UI is simply a consumer of the underlying REST APIs exposed by every Rubrik system. What does this mean? It means that everything in Rubrik is easily automated by REST APIs. Since just about every automation platform supports RESTful APIs, you are free to use whatever software you like or even write your own homegrown solution for Rubrik. Now you can add Backup-as-a-Service into your ITaaS portfolio! An Example Let’s say you have an Infrastructure-as-a-Service (IaaS) solution in place (vRealize Automation or vRA for example) where users can self-provision systems. You’re still likely using a manual or semi-manual process by which people request data protection for those self-provisioned systems. In a typical workflow, this would be a form the user would submit, and that form…
Rubrik -  - How We Built a Suite of Automated End-to-End Tests

How We Built a Suite of Automated End-to-End Tests

Last week, I covered the importance of quality and why we employed automated end-to-end testing. In this post, I explain how we implement this approach. We do so through a release pipeline orchestrated by Jenkins to efficiently run a large suite of end-to-end tests. These tests leverage our custom testing framework which integrates with support tooling. As we receive customer feedback, we continuously update the framework and test cases to keep up with the latest requirements. Below, I describe in more detail our release pipeline, testing framework, and product support functions that ensure our testing is faster, more efficient, and always high quality. Jenkins Continuous Integration Like many engineering organizations, we use Jenkins as our continuous integration tool. As engineers check in new code, Jenkins is continuously running the suite of tests we built, including both unit tests and end-to-end tests. This allows us to quickly detect and correct issues. Release Pipeline If we ran every full test for every code check in, we would quickly exhaust all our test resources and file duplicate bugs. Although we can easily add more test resources, duplicate bugs waste engineering time by requiring extra triage and diagnosis work. Instead, we define a release…
Rubrik -  - Automation Rules the Kingdom: Why Quality is Important For You

Automation Rules the Kingdom: Why Quality is Important For You

Automated End-to-End Testing: Ensuring Quality We take quality seriously. For both our customers and developers’ satisfaction, it is essential to provide consistent product performance and speed of development with confidence that existing use cases are not broken. To ensure agile development, here’s why quality is essential to your organization and how our strategy makes automated end-to-end testing fast, reliable, and responsive. Importance of Quality For Customers: Every company claims to deliver high quality to their customers, but this is especially critical for Rubrik. Our product is responsible for managing highly valuable data that powers our customers’ businesses. In the backup and recovery industry, our solution needs to be on active duty at the exact moment our customers experience trouble within systems protected by us. Given that these problems are complex, providing an extremely simple user experience alleviates troubleshooting. Of course, this simple user experience can only be simple as long as all the underlying pieces are performing reliably. For Engineers: Engineers want to innovate without breaking existing functionality that customers depend on. If the fundamentals fail, customers cannot upgrade without losing data. It’s often difficult to innovate without affecting the interoperating pieces. In Rubrik’s case, we integrate at all levels…
Rubrik -  - Here’s Why You Should Shelve Backup Jobs for Declarative Policies

Here’s Why You Should Shelve Backup Jobs for Declarative Policies

Changing out legacy, imperative data center models for the more fluid declarative models really gets me excited, and I’ve written about the two ideas in an earlier post. While the concept isn’t exactly new for enterprise IT – many folks enjoy using declarative solutions from configuration management tools such as Puppet – the scope of deployment has largely been limited to compute models for running workloads. The data protection space has largely been left fallow and awaiting some serious innovation. In fact, this is something I hear from Rubrik’s channel partners and customers quite frequently because their backup and recovery world has forever been changed by the simplicity and power of converged data management. To quote Justin Warren in our Eigencast recording, backup should be boring and predictable rather than exciting and adventurous because the restoration process failed, and you’re now responsible for missing data. That’s never fun. Thinking deeper on this idea, it brings me to one of the more radical ideas that a new platform brings: the lack of needing to schedule backup jobs. Creating jobs and telling them when exactly to run, including dependency chains, is the cornerstone of all legacy backup solutions. As part of their…
Rubrik -  - How Cloud Native Archive Destroys Legacy Cost Models

How Cloud Native Archive Destroys Legacy Cost Models

A while back, I was reading about the woes of one Marko Karppinen as he described the incredible ease of getting data into a public cloud, and the equally opposite horrors of getting that data back out. His post, which can be found here, outlines his crafty plan to store around 60 GB of audio data into an archive for later retrieval and potential encoding. The challenge, then, is ensuring that data can later be pulled down into an on-premises location without breaking the bank or implied SLAs (Service Level Agreements). And this, folks, is the rub when using legacy architecture that bolts-on public cloud storage (essentially object storage) without fleshing out all of the financial and technological challenges. I’ve teased apart this idea when describing the Cloud Native property of Converged Data Management in an earlier post. “Getting data into the cloud is for amateurs. Getting data back out is for experts.” If using a public cloud for storage becomes 100s of times more expensive than intended, while also requiring a significant time investment for the technical teams involved, then it’s not a solution for data protection. While it’s true that the blog post I’m referencing is a single…
Rubrik -  - Why We Built Our Own VSS Provider

Why We Built Our Own VSS Provider

In last week’s post, Kenny explained how we designed Rubrik to eliminate the effects of VMware application stun. We couple flash with a distributed architecture to deliver faster ingest that linearly scales with cluster growth. We reduce the number of data hops by collapsing discrete backup hardware/software into a single software fabric. We tightly manage the number of operations hitting the ESXi hosts to speed up consolidation. Our own VSS Provider also contributes to this effort. In this week’s post, Part 2 of our App Consistency series, I’ll explain why we built our own VSS agent and how we take app-consistent snapshots. Maintaining application and data consistency is industry standard practice for any backup solution worth its salt. To backup transactional applications installed on a Windows server (SQL, Exchange, Oracle), we utilize Microsoft’s native Volume Shadow Copy Service (VSS). Taking an application-consistent snapshot not only captures all of the VM’s data at the same time, but also waits for the VM to flush I/O operations and transactions in process. We Hate Bad Days and Sleepless Nights Failed backup jobs are a leading cause of bad days and sleepless nights, which is why we took extra care to mitigate risk factors when protecting…
Rubrik -  - Reducing the Impact of Application Stun

Reducing the Impact of Application Stun

Application stunning during the snapshot process is a topic that often bubbles up in customer conversations on data protection for VMware environments. To level set, application stun goes hand-in-hand with any snapshot operation. VMware stuns (quiesces) the virtual machine (VM) when the snapshot is created and deleted. Cormac Hogan has a great post on this here. Producing a snapshot of a VM disk file requires the VM to be stunned, a snapshot of the VM disk file to be ingested, and deltas to be consolidated into the base disk. If you’re snapping a highly transactional application, like a database, nasty side effects appear in the form of lengthy backup windows and application time-outs when the “stun-ingest-consolidate” workflow is not efficiently managed. When a snapshot of the base VMDK is created, VMware will create a delta VMDK. Write operations are redirected to the delta VMDK, which expands over time for an active VM. Once the backup completes, the delta VMDK needs to be consolidated with the base VMDK. Longer backup windows lead to bigger delta files, resulting in a longer consolidation process. If the rate of I/O operations exceeds the rate of consolidation, you’ll end up with application time-outs.Rubrik was designed…
Rubrik -  - Meet Cerebro, the Brains Behind Rubrik’s Time Machine

Meet Cerebro, the Brains Behind Rubrik’s Time Machine

Fabiano Botelho, father of two and star soccer player, explains how Cerebro was designed. Previously, Fabiano was the tech lead of Data Domain’s Garbage Collection team. Rubrik is a scale-out data management platform that enables users to protect their primary infrastructure. Cerebro is the “brains” of the system, coordinating the movement of customer data from initial ingest and propagating that data to other data locations, such as cloud storage and remote clusters (for replication). It is also where the data compaction engine (deduplication, compression) sits. In this post, we’ll discuss how Cerebro efficiently stores data with global deduplication and compression while making Instant Recovery & Mount possible. Cerebro ties our API integration layer, which has adapters to extract data from various data sources (e.g., VMware, Microsoft, Oracle), to our different storage layers (Atlas and cloud providers like Amazon and Google). It achieves this by leveraging a distributed task framework and a distributed metadata system. See AJ’s post on the key components of our system. Cerebro solves many challenges while managing the data lifecycle, such as efficiently ingesting data at a cluster-level, storing data compactly while making it readily accessible for instant recovery, and ensuring data integrity at all times. This is what…