Rubrik can gracefully backup hosts with petabyte-scale filesystems containing around a billion files. It took us focused effort and innovation, both in scaling existing systems and verifying the changes work before shipping to our customers. One of the innovative systems that helped us achieve this scalability is the FileSystem Simulator.

Main Deck_Basic illustrations_04142020_v1 [Recovered]

The figure shows the Rubrik Cloud Data Management (CDM) operation in ingesting a filesystem from over the network–NAS via NFS/SMB, or through an agent running on a host. NAS filesystems can be from a few hundred terabytes to more than a petabyte.

Iterate quickly without spending money on resources at scale

Filesystems come in all sorts of weird and complex topologies. You can have a filesystem with millions of nodes, but the majority of them fall under one single directory–sometimes 1,000,000 files in one single directory. While we have built the system to handle this complexity, we don't want to buy large licensed file storage systems from vendors and fill them up with preset data for verification because:

  1. It costs more to license the filesystems

  2. It takes time to populate data and manage the resources to simply verify a new configuration

  3. Verifying the data integrity backed up by Rubrik involves restoring the backup to a different part of the filesystem and then verifying the hash value of each file. This means it requires 2x the space of the original backup for a given snapshot

We have spent a lot of time completing all the three above. We wanted a solution to do all of the above without developers going through the route of managing the resources and setting them up. So we initially started with the idea of simulating a given filesystem in memory, by possibly taking a simple YAML file as configuration, that could look something like this:

### File - configuration.yaml parameters:

 size: 10737418240  # 10GB

 change_rate: 0.05

 growth_rate: 0.01

 files: 10000

The above configuration, should give us a filesystem with 10,000 files and totaling a size of 10GB, and it should bring it up in less than a second. The change rate and growth rate are needed to enable changes in the filesystem, between successive snapshots taken by a Rubrik cluster. The changes should be both in terms of files and filesystem tree.

VFSim - A Verifiable Filesystem Simulator

Starting with the above simple configuration, overtime we ended up building a quite comprehensive simulator with many capabilities:

  • Ability to simulate filesystems based on a seed

    • This ensures that we always generate the same filesystem and file contents with the given seed and parameters

    • This can help us a lot in reproducing a bug that we have seen when using simulator, by using the same seed

  • Ability to simulate snapshots for the filesystem

    • Rubrik backs-up filesystems periodically based on an SLA. So the simulator has the concept of snapshots which has incremental changes within each, but still follow the basic rule of having the same exact data for a given snapshot and seed

  • Ability to verify snapshots previously generated

    • Let’s say Rubrik has taken 10 snapshots on the filesystem, but now want to restore backup of the second snapshot, the simulator allows rubrik cluster to restore one of the previously backed up snapshots to the simulator. Simulator verifies the data written (byte to byte) is exactly the same as data it produced for the restored snapshot.

  • Minimal setup required

    • It takes less than a minute to initialize a PB scale filesystem and consume it. From the view of Rubrik cluster, it cannot differentiate if it is backing up from a simulator or from a real host

  • In-memory data generation

    • It does not need any persistent storage for data generation and verification. This is important because it helps us bypass the slow disk read/write operations for high throughput

  • Complex data generation

    • It generates data that produces a complex modification across entire datasets (dedupe, hardlinks) between generations

    • For example, it can quickly give a filesystem with:

      • 10 million files with 100TB total size

      • Exponentially distributed filesizes with mean around 100MB

      • 1000 files on average per directory

      • Add 5% additional files/bytes between snapshots. This includes both changes in files as well as adding new files to filesystem tree.

      • Remove 3% existing files/bytes between snapshots

Main Deck_Basic illustrations_04142020_v1 [Recovered]

The diagram above shows how VFSim is used to verify data that was generated for snapshot 0 even after the cluster has ingested two more snapshots

In the next upcoming parts of this series, we'll cover in-depth how different modules of the VFSim are implemented, and the technical challenges we've encountered when trying to create this huge filesystem entirely in memory.