NoSQL databases are becoming mainstream. As enterprises deal with the changing characteristics of data and applications, many are leveraging the inherent speed and redundancy of NoSQL databases. But the downside of this native redundancy is that NoSQL databases typically store 3 or more copies of each piece of data, resulting in pricey backup and storage costs.

Rubrik Datos IO (RDIO) provides powerful space efficiency capabilities in a modern data management product purpose-built for distributed architectures such as NoSQL databases. Our approach helps customers realize their NoSQL database deployments in an easy-to-use data protection solution that enables them to achieve up to 96.4% in backup storage savings. This blog post explores two major features that make this possible: semantic deduplication and incremental forever.  

Rubrik Datos IO’s industry-first semantic deduplication is the answer to the fundamental shortcomings that traditional block-level deduplication encountered in the world of modern distributed NoSQL databases (MongoDB, Apache Cassandra, DataStax). I’ll dive into the limitations of two distributions of Cassandra, DataStax Enterprise and Apache Cassandra, and how RDIO addresses them. At a high-level, these deduplication shortcomings fall into two categories: compression and housekeeping.

One of the main reasons NoSQL deduplication is challenging is that the majority of organizations run compressed tables, making block-based deduplication ineffective. As we have all experienced, compression and deduplication mix like oil and water. Cassandra has the additional challenge of performing ongoing housekeeping tasks such as compaction. As stated by Apache Cassandra, “The concept of compaction is used for different kinds of operations in Cassandra, [but] the common thing about these operations is that it takes one or more sstables and output new sstables.” This results in what appears to be new data at the block level even though it is logically the same record. Semantic deduplication operates at the logical level, enabling it to overcome both of these challenges in its pursuit of eliminating redundant replicas that are inherent to NoSQL databases.

In addition to semantic deduplication, RDIO provides operational and space efficiencies by creating incremental forever via a simple-to-use GUI, CLI, and API. Operational efficiency comes in the form of avoiding the rat’s nest of scripts, snapshots, hard-links, and job coordination that admins have become familiar with when attempting to manage Cassandra rings that depend on ‘nodetool snapshot.’ In order to realize incremental savings with the native tools, admins have to manually manage the relationship between the base snapshot, incremental hard-links, schemas, and commit logs in a coordinated fashion across numerous nodes. (Not to mention, the post-snapshot task of moving the interrelated objects off each node in order to provide an acceptable level of protection.) This then requires setting up jobs to purge all of these objects at the desired retention intervals. Even when all of these steps are taken to setup a viable snapshot-based approach, the native solution provides no form of deduplication.

RDIO takes a radically different approach with an orchestrated end-to-end solution that delivers semantic deduplication and incremental forever all copied to backup storage (object, cloud native, NFS, etc.) with just a few clicks.

Check out the full report for a deeper dive into our deduplication findings. Want to learn more? Contact us, and we’ll be thrilled to show you how easy RDIO makes this, plus a lot more (think automating test/dev refresh)!