Building an Infinitely Scalable Time Machine
A lot has changed in how we manage our personal data (e.g. photo albums) over the last decade. It used to require a lot of effort to store and protect our albums – you’d need to transfer photos from a camera to a computer, burn them on CD drives or external hard disks, and buy more storage as space ran out. Accessing albums was cumbersome – you had to locate the right media and manually browse through the albums to locate the desired photo.
Today, we deal with none of the complexity from years past. We take pictures with our phone, and our photos are automatically backed up and managed in the cloud, which never runs out of space. Any photo can be accessed instantly from any device, anywhere in the world.
Unfortunately, businesses still deal with all the complexity of managing their data that we, as consumers, faced in the past. In fact, business data is far more complicated than photo albums. Businesses have databases that are constantly changing and need to access both current and past versions of their data. Furthermore, businesses have stringent requirements around the availability and security of their data.
Storing data is hard as the amount of data keeps growing and storage systems run out of space, forcing the business to buy new storage systems and manually distribute their data across their existing and new systems.
Protecting data is complex as it involves deploying a suite of software and hardware (e.g. backup software, backup agents, backup media, database servers to store backup catalogs) while making sure all of these products work well with each other. These solutions are difficult to use, and lack the simplicity of consumer products that we use daily.
Accessing data is complicated. Business units need to maintain their own storage systems and periodically copy production data into these systems and maintain this infrastructure. This is due to the fact that current storage systems are not able to scale to meet the needs of different business units.
At Rubrik, our mission is to make it really easy for businesses to store, protect, and access their data. We believe businesses should have a simple and effective solution to manage their data just as consumers do.
It’s a bold and ambitious mission. Building a product to manage all of the world’s business data is no easy task, but with a world-class engineering team, we feel confident we can do it.
Today we are launching our engineering blog, where we’ll talk about how we are solving some of the challenges we face and the cool technologies that we are developing along the way. In this first post, I’d like to share some of the key components of our system. In future posts, we’ll cover each component in more detail.
- Distributed File System: There are two primary attributes that make a storage system easy to use: infinite storage and scalable I/O performance. Infinite storage removes the need to manage multiple storage systems and distribute data among them. In addition, you don’t need to forecast and do fork-lift upgrades. Scalable I/O performance allows you to run diverse application workloads on a single system and serve the needs of every business unit. A single storage system also makes it easy to track all access to enterprise data and ensure data privacy. Our distributed file system is designed to be infinitely scalable and uses a hybrid flash/hard disk architecture to achieve scalable performance. Increasing storage capacity and I/O performance is as simple as adding nodes to the system.
- Enterprise Time Machine: Rubrik builds a time machine of all the enterprise’s data. Versioned data management is a core feature of our file system. The time machine is a very powerful abstraction. It allows Rubrik to serve as a backup system for the enterprise; moreover, since all versions of the data are instantly available for reads and writes, it allows all enterprise applications to run directly on our system. There is no need to provision additional storage systems for applications that need to access this data.
- Distributed Metadata System: Rubrik’s file system and the time machine both depend on a scalable key-value store. The key-value store needs to be fast, scalable and highly available. By using RAM to cache recently accessed data and using flash as the persistent storage tier for all the data, our distributed metadata system achieves the required performance and serves as the fundamental building block for the rest of our system.
- Cluster Management System: This component manages the Rubrik system setup and ongoing health. We use a zero-configuration multicast DNS protocol to automate appliance discovery. Creating or growing the cluster requires minimal manual intervention, with new nodes auto-discovering each other. Furthermore, the system self-heals to ensure that it runs smoothly in presence of failures.
- Global Search: Rubrik makes all enterprise data instantly searchable. The search works across historical and current versions of data, making it easy to instantly discover and recover any version of the data from the system.
- Workflow Framework: This is the engine that assigns and executes tasks across the cluster in a fault tolerant and efficient manner. The framework ensures that tasks are load balanced across the entire cluster, and are distributed to the nodes that contain the relevant data.
- Hybrid Cloud: The cloud is great for variable workloads, but enterprises have found it hard to adopt because of difficulties in seamlessly moving data offsite. Rubrik provides a single system that can securely send data to the public cloud, in addition to helping enterprises manage data whether it’s in the public cloud or on-premise.
In forthcoming posts, we’ll explore each one of these components in detail. Stay tuned!