Architecture

Rubrik -  - How Cloud Native Archive Destroys Legacy Cost Models

How Cloud Native Archive Destroys Legacy Cost Models

A while back, I was reading about the woes of one Marko Karppinen as he described the incredible ease of getting data into a public cloud, and the equally opposite horrors of getting that data back out. His post, which can be found here, outlines his crafty plan to store around 60 GB of audio data into an archive for later retrieval and potential encoding. The challenge, then, is ensuring that data can later be pulled down into an on-premises location without breaking the bank or implied SLAs (Service Level Agreements). And this, folks, is the rub when using legacy architecture that bolts-on public cloud storage (essentially object storage) without fleshing out all of the financial and technological challenges. I’ve teased apart this idea when describing the Cloud Native property of Converged Data Management in an earlier post. “Getting data into the cloud is for amateurs. Getting data back out is for experts.” If using a public cloud for storage becomes 100s of times more expensive than intended, while also requiring a significant time investment for the technical teams involved, then it’s not a solution for data protection. While it’s true that the blog post I’m referencing is a single…
Rubrik -  - Why We Built Our Own VSS Provider

Why We Built Our Own VSS Provider

In last week’s post, Kenny explained how we designed Rubrik to eliminate the effects of VMware application stun. We couple flash with a distributed architecture to deliver faster ingest that linearly scales with cluster growth. We reduce the number of data hops by collapsing discrete backup hardware/software into a single software fabric. We tightly manage the number of operations hitting the ESXi hosts to speed up consolidation. Our own VSS Provider also contributes to this effort. In this week’s post, Part 2 of our App Consistency series, I’ll explain why we built our own VSS agent and how we take app-consistent snapshots. Maintaining application and data consistency is industry standard practice for any backup solution worth its salt. To backup transactional applications installed on a Windows server (SQL, Exchange, Oracle), we utilize Microsoft’s native Volume Shadow Copy Service (VSS). Taking an application-consistent snapshot not only captures all of the VM’s data at the same time, but also waits for the VM to flush I/O operations and transactions in process. We Hate Bad Days and Sleepless Nights Failed backup jobs are a leading cause of bad days and sleepless nights, which is why we took extra care to mitigate risk factors when protecting…
Rubrik -  - Reducing the Impact of Application Stun

Reducing the Impact of Application Stun

Application stunning during the snapshot process is a topic that often bubbles up in customer conversations on data protection for VMware environments. To level set, application stun goes hand-in-hand with any snapshot operation. VMware stuns (quiesces) the virtual machine (VM) when the snapshot is created and deleted. Cormac Hogan has a great post on this here. Producing a snapshot of a VM disk file requires the VM to be stunned, a snapshot of the VM disk file to be ingested, and deltas to be consolidated into the base disk. If you’re snapping a highly transactional application, like a database, nasty side effects appear in the form of lengthy backup windows and application time-outs when the “stun-ingest-consolidate” workflow is not efficiently managed. When a snapshot of the base VMDK is created, VMware will create a delta VMDK. Write operations are redirected to the delta VMDK, which expands over time for an active VM. Once the backup completes, the delta VMDK needs to be consolidated with the base VMDK. Longer backup windows lead to bigger delta files, resulting in a longer consolidation process. If the rate of I/O operations exceeds the rate of consolidation, you’ll end up with application time-outs.Rubrik was designed…
Rubrik -  - Meet Cerebro, the Brains Behind Rubrik’s Time Machine

Meet Cerebro, the Brains Behind Rubrik’s Time Machine

Fabiano Botelho, father of two and star soccer player, explains how Cerebro was designed. Previously, Fabiano was the tech lead of Data Domain’s Garbage Collection team. Rubrik is a scale-out data management platform that enables users to protect their primary infrastructure. Cerebro is the “brains” of the system, coordinating the movement of customer data from initial ingest and propagating that data to other data locations, such as cloud storage and remote clusters (for replication). It is also where the data compaction engine (deduplication, compression) sits. In this post, we’ll discuss how Cerebro efficiently stores data with global deduplication and compression while making Instant Recovery & Mount possible. Cerebro ties our API integration layer, which has adapters to extract data from various data sources (e.g., VMware, Microsoft, Oracle), to our different storage layers (Atlas and cloud providers like Amazon and Google). It achieves this by leveraging a distributed task framework and a distributed metadata system. See AJ’s post on the key components of our system. Cerebro solves many challenges while managing the data lifecycle, such as efficiently ingesting data at a cluster-level, storing data compactly while making it readily accessible for instant recovery, and ensuring data integrity at all times. This is what…
Rubrik -  - Contrasting a Declarative Policy Engine to Imperative Job Scheduling

Contrasting a Declarative Policy Engine to Imperative Job Scheduling

One of the topics du jour for next-generation architecture is abstraction. Or, more specifically, the use of policies to allow technical professionals to manage ever-growing sets of infrastructure using a vastly simpler model. While it’s true I’ve talked about using policy in the past (read my first and second posts of the SLA Domain Series), I wanted to go a bit deeper into how a declarative policy engine is vastly different from an imperative job scheduler. And, why this matters for the technical community at large. This post is fundamentally about declarative versus imperative operations. In other words: Declarative – Describing the desired end state for some object Imperative – Describing every step needed to achieve the desired end state for some object Traditional architecture has long been ruled by the imperative operational model. We take some piece of infrastructure and then tell that same piece of infrastructure exactly what it must do to meet our desired end state. With data protection, this has resulted in backup tasks / jobs. Each job requires a non-trivial amount of hand holding to function. This includes configuration items such as: Which specific workloads / virtual machines must be protected Where to send data and how to store that…
Rubrik -  - Converged Data Management Unwrapped – Cloud Native

Converged Data Management Unwrapped – Cloud Native

Intelligently placing data into a variety of different formats and across geographical locations is non-trivial. With data protection, however, this isn’t just a nice to have; it’s often a functional design requirement. Doing so provides layers of safeguards against data-specific failures as well as local or regional catastrophes. In this fifth and final deep dive series post, I’m going to pick apart how Converged Data Management offers a truly Cloud Native experience for data protection workflows, and how that differs from the traditional approaches. There’s a fundamental difference between adapting a platform to take advantage of cloud data services, such as public object storage with Amazon S3, versus natively making it part of the platform. This difference can be extrapolated into one metric – simplicity. The amount of complexity that surrounds a platform drives greater inefficiencies and increased chances for error. Additionally, the base foundation of a platform ultimately dictates what features and properties are available for a long-term strategy. Without re-writing the platform from scratch – which is something largely avoided in the enterprise market – the choices are limited. After all, the spin-out and spin-in model used by large corporations to innovate doesn’t exist without reason. In the…
Rubrik -  - Converged Data Management Unwrapped – Instant Data Access

Converged Data Management Unwrapped – Instant Data Access

Now that the Rubrik team has returned from an energizing trip to VMworld Barcelona and TechTarget’s Backup 2.0 Road Tour, it’s time to peel the onion a bit further on Converged Data Management in the fourth installment of this deep dive series. In this post, I’ll put the magnifying glass up against Instant Data Access – which is the ability to see and interact with data in a global and real-time manner – to better understand why it’s a critical property in the architecture shown below: First, let’s take a step back. There’s a lot of hemming and hawing on the Internet about data lakes, big data, the Internet of Things (IoT), and so forth. The rub is that data is growing (exponentially), is largely unstructured, and there’s no way to put our hands around it in any meaningful way without machine learning and a plethora of automation. ImageNet is a great example of this in a real world, practical format. This has led to some really nifty architectures being crafted to deal with large data sets and a rise in the popularity in storage systems that are eventually consistent, such as object storage clustering, rather than strongly consistent as typically…
Rubrik -  - Converged Data Management Unwrapped – API-Driven Architecture

Converged Data Management Unwrapped – API-Driven Architecture

Welcome to the third post in the Converged Data Management series: API-Driven Architecture. To best understand how this property impacts the modern data center, it’s important to take a step back and view how today’s services – meaning the various number of servers that construct an application offered to the business or its clients – are created, consumes, and retired. This is often referred to as lifecycle management.As a service is instantiated, there are a number of lifecycle milestones to reach. These include the details needed to request a service from a portal or catalog, the provisioning tasks to build the service, various care and feeding events while the service runs, and ultimately the retiring and archival aspects of the service. More often than not, these milestones are wrapped into various orchestration engines and front-ended by a Cloud Management Portal (CMP) for administrative and tenant based consumption. In other scenarios, there are still automation workflows being utilized by way of point-and-click scripts and customization tasks. An API-Driven Architecture comes into play as IT professionals work to progress a service through the various lifecycle milestones along with other third party tools and infrastructure stacks. As technical teams attempt to piece together…
Rubrik -  - Converged Data Management Unwrapped – Infinite Scalability

Converged Data Management Unwrapped – Infinite Scalability

In the second part of my series on Converged Data Management, I’m putting another property under the microscope – Infinite Scalability. The underlying premise is that the fabric that is providing data management can be deployed in a shared-nothing manner with a limitless architecture focused on linear growth. Woah, what does that all mean?Let’s pick these ideas apart, one by one. A shared-nothing system is one built of a series of nodes that have no dependency upon each other. If a node fails, or parts of a node fail, the fabric remains healthy and operational without any negatively impacting penalties. Ideally, this architecture is expanded beyond the node itself, expanding out to the enclosure, rack, or even entire data centers. Contrast this to systems that are reliant upon dependencies and use alternative tricks to hide or protect them – load balancers, failover clustering, and so forth. If a failure occurs, performance suffers due to the need to ingest data into a central choke point – such as a master server, quantity of proxy nodes, or a database instance. Availability is also put at risk, especially considering that most components in a dependency chain have only a single failover counterpart because of…