Secure Your RAG Pipeline from Sensitive Data Leaks with Rubrik Annapurna

As enterprises accelerate their adoption of Retrieval-Augmented Generation (RAG) to leverage internal data with GenAI, the focus shifts to building these systems securely. Protecting sensitive information is non-negotiable.

The reality of enterprise data is that it's messy and often contains sensitive information scattered across countless files and systems. We're talking about things like customer PII, financial records, protected health information, or confidential business details. The challenging part? Enterprises are frequently unaware of every location where this sensitive data resides within their vast data landscape.

Rubrik Annapurna takes the complexity out of this critical challenge by enabling secure, sensitive data-aware RAG built on the data you already trust us to protect. Rubrik Annapurna allows enterprises to easily build RAG systems directly on top of files backed up within Rubrik Security Cloud. Instead of building complex, separate data pipelines to feed information to your RAG applications or trusting your data with another product, Annapurna leverages the data already within your Rubrik environment.

RAG and solving the sensitive data problem

Building a RAG system on top of your vast stores of enterprise data without careful consideration is like opening a firehose of potentially sensitive information. A user asking an innocent question could inadvertently retrieve or trigger an LLM response that includes data they should never see. Further, sending enterprise sensitive data to an external GenAI embedding model or LLM means routing that data to a party outside of the enterprise.

For many organizations, these risks are non-starters for adopting RAG at scale due to compliance mandates and the critical need to prevent data exposure. It is absolutely vital that a RAG system built on enterprise data includes robust mechanisms to identify and filter sensitive information before it reaches end users or other organizations.

Securing data is in Rubrik's DNA. Since 2018, Rubrik has been at the forefront of data security, helping organizations protect their most critical asset against ransomware and other threats. This isn't a new area for us; systems like Radar and Rubrik Data Security Posture Management (DSPM) have positioned Rubrik with sophisticated, battle-tested technology for discovering and understanding sensitive data across diverse environments.

This deep expertise gives Rubrik a unique advantage when it comes to RAG. We're not starting from scratch with sensitive data detection for AI. We're leveraging existing, mature frameworks already integrated within Rubrik Security Cloud.

With this foundation, Rubrik has adopted what we call a zero trust approach to sensitive data in RAG systems. There are two main facets of this approach. Firstly, in the Annapurna framework, sensitive data is identified and filtered before the data is even embedded as a vector in a vector database. Secondly, even after the sensitive data has been filtered out, the actual contents of files are not stored anywhere outside of Rubrik.

This approach offers the highest level of security assurance. Sensitive data is not only prevented from being stored externally or appearing in retrieved results and LLM-generated responses, but it is also not reflected in embeddings within the vector database itself. This proactive filtering minimizes the attack surface and reduces the risk of sensitive data exposure at multiple layers of the RAG architecture.

Integrating Rubrik DSPM with Annapurna

Annapurna’s integration of sensitive data detection leverages the exact same sensitive data detection policies that already exist as a part of Rubrik DSPM. This allows organizations to apply consistent, granular policies they've already defined for their broader data security posture directly to their RAG data sources. No reconfiguration of complex custom setups necessary.

This flexibility also means you can tailor exactly what constitutes must-filter sensitive data for each specific RAG application. When configuring a RAG system on Annapurna simply select which policies you would like to have in effect and the rest will be taken care of for you.

Once the policies for an Annapurna RAG system are set, the sensitive data filtering takes place during the vector embedding process. This is when the contents of your enterprise source documents are vectorized and embedded into a vector database for future retrieval. The following demonstrates this workflow at a high level:

Content Parsing: Document contents are extracted and parsed from the source files within your Rubrik backups.
Sensitive Data Detection: Annapurna integrates with Rubrik DSPM to scan the parsed document contents and identify sensitive data.
Metadata Capture: When sensitive data is detected, the system captures detailed metadata about the match. This includes information such as the exact start and end offsets of the sensitive data within the text and the specific type of sensitive data detected (e.g., credit card number, Social Security Number, email address).
Policy Configuration: A check is performed to see if the policies configured for a RAG system dictate that the sensitive data match needs to be handled or if it can be safely ignored.
Filtering and Masking : If a sensitive data match belonging to a configured policy is detected, Annapurna takes action before the data is chunked and sent for embedding. Annapurna supports a couple methods of handling sensitive data matches:
- Dropping Information: The simplest approach is to drop the data containing the sensitive match. This could mean dropping the entire document if it contains sensitive data, or more granularly, dropping just the specific chunked section of the document that contains the match.
- Masking (available August 2025): Alternatively, the metadata of the match (like offsets and type) can be forwarded to a masking service. This service can then replace the instance of sensitive data with an appropriate masking string (e.g., replacing a credit card number with [CARD_NUMBER] or **REDACTED**). The masked data is then used for chunking and embedding, and the original sensitive value is not ingested into the vector database.

This integrated, policy-driven approach means that as your data is prepared for RAG, sensitive information is proactively identified and handled according to your organization's security and compliance requirements, right at the source—your Rubrik-protected data.

Building on a foundation of trust

Rubrik is building Annapurna on our core strength: securing the world's data. We understand the paramount importance of data security and governance, and Annapurna extends that trust to the burgeoning world of enterprise AI.

By leveraging the sophisticated sensitive data detection capabilities already present within Rubrik Security Cloud and implementing a zero trust approach, Rubrik Annapurna provides a secure and efficient way to unlock the value of your protected data with RAG, helping you climb towards that summit of AI-driven insights with confidence.

Check out this demo to see Rubrik Annapurna in action.

Any unreleased services or features referenced in this document are not currently available and may not be made generally available on time or at all, as may be determined in our sole discretion. Any such referenced services or features do not represent promises to deliver, commitments, or obligations of Rubrik, Inc. and may not be incorporated into any contract. Customers should make their purchase decisions based upon services and features that are currently generally available.

Products

Solutions

Knowledge Hub

About Us

RAG and solving the sensitive data problem

Integrating Rubrik DSPM with Annapurna

Building on a foundation of trust