More and more enterprises are adopting modern NoSQL databases like MongoDB and Apache Cassandra (DataStax) to enable rapid development of next-generation applications (AI/ML, IoT, eCommerce, customer experience). However, while these databases can help speed up application development, they lack enterprise-class recovery solutions, putting organizations at risk of data loss. While NoSQL databases offer capabilities such as cross data-center database replication, they do not provide point-in-time backup and recovery. If data errors are introduced or data is accidentally deleted, the databases’ redundant-node replication can lead to almost immediate corruption of critical data across all nodes.
In this post, I’ll dive into NoSQL data protection requirements and the technical challenges that enterprise- grade data management solutions must address.
Comprehensive data management is a must for running mission-critical applications in enterprise environments (private cloud, hybrid clouds, or public clouds). As the shift to these next-generation applications and NoSQL databases increases, we see new data management requirements emerging:
- Eventually-consistent databases require novel point-in-time techniques for consistent state across a database cluster.
- The elastic nature of next-generation databases necessitates backup and recovery to be highly available, scalable, and failure resilient.
- Backups need to be maintained in native (source) formats for advanced data management services such as search and analytics.
- Recovery architecture needs to meet a spectrum of RTO requirements from the granularity of seconds to years.
- A software-only deployment model is required for on-premises or cloud-native deployments, allowing customers to choose their storage.
- A multi-database platform needs to be supported to achieve economies of scale within enterprise environments.
Technical Challenges
At Rubrik, we have built Datos IO to meet the requirements listed above for non-relational NOSQL databases. We call our “secret sauce” the Consistent Orchestrated Distributed Recovery (CODR). CODR architecture is based on two core principles: 1) built to handle the volume and velocity of NoSQL databases and 2) provide deep operational automation for the applications and NoSQL databases user without compromising data integrity.
Here are the biggest NoSQL technical challenges enterprises face:
Application-consistent backups: A key problem that is common in quorum-based replication schemes is determining the order of updates across replicas, which is essential in deciding which values should comprise a snapshot. For example, if two write requests to the same database object arrived at two different nodes at the same time, it is difficult to determine a strict ordering between the two write requests. The lack of ordering thus makes it challenging to determine the latest value of the database object at any given point in time.
Recovery after reconfiguration: Even if you capture a consistent point-in-time backup of a clustered NoSQL database system, recovery is challenging due to the changing nature of clustered systems topology. For instance, database topology can change (e.g. going from 3-node to 6-node or from a replica set to a sharded MongoDB cluster) between the time of backup and the time of restore. The data protection platform needs to reconcile these differences without incurring the expense of database repairs and lengthy downtime.
Database and node failures during backups and restore: Since NoSQL databases are built to scale to hundreds of nodes, node failures are a norm . So any backup strategy must be able to take into account failures in data capture from down nodes and its impact on quorum consistency. On the reverse path, restores must also account for failed nodes in the cluster and adjust the repopulation of data accordingly.
De-duplication for NoSQL databases: In traditional database systems, de-duplication is simpler—a block and identical bitwise. However, in modern NoSQL database storage systems, data copies are not always exactly identical as ordering and flushing of updates differs across nodes, thus creating a new challenge for de-duplication.
Scale (data size and number of nodes): Consider a large 500-node system with a 100 TB physical data capacity and a 1 KB record size. If database compaction is used, we have to process 100 billion rows across 10 million files. The scale challenge is that all backup and restore operations at this large-scale cluster must meet customer SLAs in terms of RTO and RPO.
Performance (RPOs and RTOs): Consider the same system as above. If the per row processing is slow, then it will not be possible to meet customer SLAs in terms of RTO, as it is not possible to parallelize database file processing beyond a certain point. Recovery operations are affected as well because of the same limits of parallelization. This means that processing for backup and recovery for 100 billion rows across 10 million files should be fast and a solution must drastically reduce per-row and per-file processing of database files.
Data Integrity: Block-level data integrity verification is another challenge with distributed NoSQL databases. Checksums work for scale-up databases because the restored data is physically identical to the backup data. However, in scale-out databases, the restored data is semantically identical to the backup data but is not physically identical. In such a situation, we need to develop a novel mechanism to detect semantic equivalence between the restored data and the backup data that will allow us to isolate data corruption issues that might arise during the backup and restore processes.
Data Management Services: Data and the intelligence derived from the data is what distinguishes a successful organization from its peers. It is no secret that while a large corpus of data resides in secondary storage versions, the amount of business intelligence derived from the data is slim to none. As a result, organizations need a new approach that allows business intelligence tools to derive value from a data repository.
Enterprises across industries are leveraging next-generation NoSQL databases as they accelerate their digital transformation strategy. However, with this shift comes the need for new products and solutions that are purpose-built to address the technical challenges of the distributed application architecture. Rubrik Datos IO was built to tackle the biggest NoSQL pain points with simplicity so that enterprises can ensure protection and unlock new value from their mission-critical data.
Stay tuned for our next blog posts on how our CODR technology, a key building block of Rubrik Datos IO, tackles these NoSQL challenges and simplifies data protection for enterprises.
Interested in learning more about Rubrik Datos IO? Register for a hands-on evaluation.