How to Back Up Cassandra Databases

Today’s data is increasingly vast and unstructured, making it hard to store in traditional relational databases with a fixed table structure. As enterprises migrate more applications and data to the cloud, they’re also investing in developing modern, cloud-native applications that must be deployed quickly. These modern applications need a data store that’s agile, flexible, and high-performing. This is where non-relational NoSQL (“Not Only SQL”) databases like Apache Cassandra step in.

A Cassandra database is geared for high volume-based data sets and is especially suited to enterprise use cases such as Internet of Things (IoT) and personalization. Cassandra uses data in column families that are similar to tables in relational databases. The families contain columns and rows, and each row has a unique key. Columns can be added on the fly and are accessed using the Cassandra Query Language (CQL), which is similar to SQL in syntax. Because Cassandra is non-relational, it stores and retrieves data differently than relational databases do.

Likewise, NoSQL databases require a different data protection and backup solution. Traditional backup methods address the requirements of structured applications on relational databases that used shared storage. They fall short of addressing the point-in-time backup requirements of modern applications based on NoSQL distributed databases like Cassandra.

Ideally, organizations deploying NoSQL-based applications need both backup and replication to successfully protect their data.

Cassandra Database Backup Requirements

While native backup tools exist, they don’t fully address Cassandra’s architectural nuances. Following are some of Cassandra’s key backup and recovery requirements.

Requirement #1–Point-in-time backup. Native database backup tools take node-by-node snapshots, which are essentially local snapshot backups. But this method is error-prone and not scalable. Moreover, it doesn’t provide the point-in-time backup you need to recover from a data loss disaster.

Requirement #2–Any point-in-time recovery. Enterprises frequently need to refresh their test/dev clusters with the latest protection data to enable continuous integration (CI) and continuous development (CD). Test/dev clusters have different topologies with different numbers of nodes than production database clusters. It takes hours or days to refresh each cluster using native solutions, impacting developer productivity.

Requirement #3–Data masking. Native tools don’t have the option to mask out particular columns while recovering confidential data. This shortfall can have negative implications for enterprises that handle sensitive data such as personally identifiable information.

Requirement #4–Node failure handling during backup and recovery. With a native solution, if a source node fails during backup, the backups for that node stop, which can cause data loss or inconsistency in a backed-up data set. You need your database backup solution to perform optimally when node failures occur.

Requirement #5–Time-to-live (TTL) support. There’s no way to adjust TTL during restores using a native backup solution. If TTL is already expired during recovery, the restored data is automatically expired by Cassandra.

Requirement #6–Granular-level data protection. Native solutions mostly perform only keyspace-level backup with no flexibility to do column family-level backup. So, all column families in a keyspace will be backed up using the same backup frequency and retention policy. Even column families that aren’t needed but are in the same keyspace will be backed up.

Requirement #7–Efficient database backup storage. Native backup solutions can increase your storage costs. For example, all replica copies are kept in backup storage; these backups are stored in Cassandra nodes as well as in optional secondary storage. Also, newly generated compacted SSTables (immutable data files that Cassandra uses for persisting data on disk) are backed up in addition to the SSTable from which the new SSTables originated. All of this activity results in increased database backup storage costs.

Rubrik Mosaic for NoSQL Database Backup

Rubrik Mosaic is purpose-built software designed specifically to solve the challenges of backup and recovery for modern NoSQL databases and big data file systems. Rubrik Mosaic addresses the Apache Cassandra backup requirements noted here along with other key requirements.

Rubrik Mosaic:

Delivers point-in-time backups at user-specified intervals
Streams data in parallel to secondary backup storage. Backups are cluster-consistent across any Cassandra cluster size.
Uses patented semantic deduplication technology to cut on-premises and cloud storage costs up to 70%
Automates test/dev refresh with column family-level granularity and cross-cloud mobility. Any-to-any topology restore delivers data from and to unlike clusters.
Provides advanced support for TTL

Learn more about how Rubrik can protect your Cassandra data. And check out our white paper, the Definitive Guide to Backup and Recovery for Cassandra.

Frequently asked questions

How do I backup a Cassandra database?

Cassandra databases can be backed up and pushed to secure cloud storage using Rubrik. Rubrik’s software provides point in time backups that are securely backed up and a perfect choice for high volume data sets.

What is Cassandra backup?

A Cassandra backup is a point in time snap shot of a Cassandra database that is saved in a secure location with regular redeployment testing.

Where is data stored in Cassandra?

Cassandra databases store data in a structure called a memtable that logs every write to the database. This provides a log of all input and changes made to the database.

Products

Solutions