Inside Rubrik AI: The Lessons We Learned Architecting an AI Agent From Scratch

In this blog

Starting With a Hard Constraint: Beyond a Chatbot
Lesson 1: Architect for Capability Growth, Not Today's Features
Lesson 2: Multi-Agent Beats Mega-Agent
Lesson 3: Don't Build the Runtime, Pick the Right Framework
Lesson 4: Keep the Agent Count Small and Compound Through APIs and Skills
Lesson 5: Skills are How Rubrik AI Scales Beyond the Platform Team
Lesson 6: Skills Are Cheap to Write, Expensive to Get Wrong
Lesson 7: Evals Everywhere Because If You Can't Measure It, You Can't Ship It
Lesson 8: Pivot Fast When the Landscape Moves
Lesson 9: Security & Human-in-the-Loop are Not Features, They Are the Foundation
Lesson 10: Ship Chat First, Then Earn UI
What I'd Tell Someone Building Their Own
Looking Forward
One More Thing: Shout Out to the Team!

Rubrik AI is a secure AI Agent embedded in Rubrik Security Cloud (RSC). We built it from the ground up. Customers use it to answer questions about their environment, troubleshoot issues, generate reports and scripts, run security workflows, and take actions on their data. It operates across many languages and the full breadth of the RSC platform.

But how did we get here? We had to architect the system from a blank page to production. Along the way, we had to make a few architectural bets and trade-offs, and learned some lessons that shaped the system as it grew.

When we set out to build Rubrik AI, we asked ourselves a deliberately uncomfortable question: what if Rubrik AI acted like the best Rubrik expert in the room? Could it help a support engineer for a thorny error code? Could it solve an API or protocol edge case for a development engineer? Would a solutions engineer turn to it for a design call? Could a product manager get a satisfactory explanation of the "why" behind a workflow?

Here are some scenarios we sketched out:

Category	Scenario
Proactive recommendation	"Of all my CDM clusters, which should be upgraded soonest — and why?"
Activity analysis	"Who were the most active SharePoint file users last month?"
Fleet aggregate	"How many snapshots failed in 24 hours, grouped by object type?"
Custom report	Weekly compliance report — three workload types, seven recipients.
Step-by-step SOP	Walk through Oracle 19c Data Guard recovery, mid- incident.
Error diagnosis	Pasted RBK20053 event — cause, correlated logs, fix.
Multi-step troubleshooting	"vCenter is up, cluster is up — why only one ESXi host disconnected?"
Threat hunt	Triage GREYMIRROR — IOC matches, YARA hits, next actions.
Audit & identity	"How many login attempts by user [EMAIL] are in the audit logs?"
Architectural guidance	Air-gapped archival across three regions, trade-offs spelled out.
Configuration & onboarding	"How do I configure Azure archive with container- level immutability?"
Recovery edge case	"How do I recover Teams personal chats?"
Multilingual — French	"Comment obtenir un jeton JWT via SIHM?"
Multilingual — Japanese	"RSCのGraphQL APIが断続的に502を返します。原因と対策は？"

Building something that could safely address all of these scenarios in a customer's live environment, and be extendable to more categories is what we were after. Here’s how we did it.

Starting With a Hard Constraint: Beyond a Chatbot

Chat is how customers reach Rubrik AI—that's the surface. What had to sit behind it was the hard part.

The first thing we agreed on as a team was what Rubrik AI had to be more than a chatbot. It could not be a wrapper around an LLM that knows a few Rubrik facts. Nor could it be a vector search over our docs with a chat UI.

Read that list above again. Every item requires something a generic assistant cannot do: reach into the customer's own environment, understand Rubrik's domain at depth, follow a Rubrik-specific procedure to completion, and do all of it with the customer's permissions, not ours.

Rubrik AI is purpose-built. It understands Rubrik's domain. It is grounded in our official documentation. Critically, it can also act on the customer's Rubrik-managed environment through the same APIs and permissions that power the UI they already use.

That third point is the one that took the most engineering discipline to get right.

Lesson 1: Architect for Capability Growth, Not Today's Features

When we scoped Rubrik AI v1, the easy path would have been to hardcode a handful of high-value flows: "show failed snapshots," "list clusters," "open a support case." Ship it, declare victory, and iterate from there.

We didn't do that. The scenarios above include fleet-wide queries, reports, onboarding, identity, recovery edge cases. We could not have predicted, ticket by ticket, which specific flows customers would ask for. So we made a deliberate bet on two compounding axes rather than a fixed feature list:

Every GraphQL query and mutation we add to RSC becomes a Rubrik AI capability for free. RSC's API surface grows constantly, adding new products, new resources, new operations.Rubrik AI's reach should grow with it, not lag a quarter behind.
Every skill any team in the company writes becomes a new Rubrik AI capability. Skills are how domain expertise compounds—diagnostic procedures, runbooks, onboarding flows, threat-hunt playbooks—without the platform team being the bottleneck.

The two axes multiply. APIs give Rubrik AI reach, expanding what it can see and do across the platform. Skills give Rubrik AI depth: how well it handles a specific domain procedure once it gets there. Together, they mean Rubrik AI's usefulness compounds along two independent dimensions, neither of which requires the Rubrik AI team to ship custom code per use case.

That dual bet shaped almost everything that followed. It pushed us toward an agent that discovers APIs rather than one that embeds a fixed menu of them. It pushed us to invest early in a skills format, a quality bar, and a one-command authoring workflow so other teams could contribute on day one. It was slower in the first month. It paid back tenfold by month six.

Lesson 2: Multi-Agent Beats Mega-Agent

The naive design is a single big agent with all the tools wired in. We tried it. It works in demos. It does not work in production for three reasons:

Reasoning quality degrades as the tool inventory and prompt context grow.
You cannot upgrade one capability without re-validating everything.
Specialized agents can be tuned and secured independently. Search is a different problem than schema validation, which is a different problem than log triage.

That's why we built Agentic Rubrik AI as a flexible multi-agent system. A root agent classifies the customer's request and delegates it to one of four sub-agents (with more planned in the future):

Content Agent: A Retrieval-Augmented Generation (RAG) over user guides, hardware guides, and KB articles. Answers architectural and best-practice questions.
API Agent: Discovers, validates, and executes GraphQL operations against RSC. Powers fleet-wide aggregate queries, report generation, and any answer grounded in real APIs.
Skills Agent: Finds the right authored playbook for troubleshooting, runbooks, and operational tasks.
Log Agent: Securely queries backend logs across the platform. Gives troubleshooting workflows the diagnostic depth customers used to need a support engineer for.

The root agent owns orchestration; the sub-agents own depth. Have a multilingual request? The root agent handles language at the conversation layer while the sub-agents work in their native modality. Need a complex error diagnosis? Skills agent fetches the playbook, the API Agent pulls the live context, and the Log Agent confirms what the backend actually saw.

Lesson 3: Don't Build the Runtime, Pick the Right Framework

Going multi-agent forced us to face an immediate choice: should we build the runtime ourselves or stand on something existing? Coding and testing agentic interactions (agent scope, LLM loops, context management, when to invoke HITL) is subtle, fast-moving work. The broader AI ecosystem was iterating on it faster than any in-house team could. So we chose to build on an existing framework.

Four objectives drove the choice: no vendor lock-in, faster development & experimentation with orchestration patterns, debug and eval as first-class concerns, and staying current with best practices. Python was the language since it's where the agentic ecosystem lives.

We evaluated the leading frameworks and picked the one that gave us four things:

Clean primitives, no magic: Compose what you need; nothing hidden.
Sub-minute iteration in a live playground: Changes visible in seconds.
Fast release cadence: Every release closes a real gap.
Vendor and model agnostic: Stay as vendor-neutral and flexible as possible.

Underneath those headlines, four capabilities did the heavy day-to-day lifting: developer tooling, tracing and debuggability, session and memory management, and built-in evals. Each would have required months of in-house work.

The payoff is that the Rubrik AI platform team spends its time on what's uniquely Rubrik: skills, GraphQL discovery, and human-in-the-loop confirmations. We’re not distracted by agent-runtime primitives. As other RSC teams adopt the same framework for their own features, the framework becomes a shared substrate; their improvements compound for us.

Lesson 4: Keep the Agent Count Small and Compound Through APIs and Skills

The most tempting mistake in a multi-agent system is to spin up a new agent for every new domain. If you add a new product area, you add a new agent. If you need a new troubleshooting flow, you add a new agent. Six months in you have forty still-immature agents and a routing layer no one fully understands.

We made the opposite bet. Keep the agent count small. Make each one genuinely powerful. Push all capability growth through APIs and Skills instead.

That's why Rubrik AI has four sub-agents and not forty. Each is a general-purpose specialist. The API Agent can execute any GraphQL operation the platform exposes. The Content Agent can retrieve from any document we index. The Skills Agent can load any authored playbook. The Log Agent can query any in-scope log source. None of these agents are hardcoded to a specific Rubrik product, domain, or workflow.

This is what makes the dual-axis bet from Lesson 1 actually work:

A new GraphQL API ships → the API Agent picks it up. No new agent.
An engineer writes a new skill → the Skills Agent loads it on match. No new agent.
A new KB article lands → the Content Agent retrieves it. No new agent.

Every one of those is a capability multiplier that costs the platform team nothing.

The discipline this requires is mostly saying no. Every quarter, someone proposes a VMware Agent or a Threat Response Agent. Almost every time, the right answer is: “That's not a new agent, it's a new skill, a new API, or a new doc. Build it on the foundation we already have.”

The few times we have added agent-level capability, it has been because a genuinely new tool class appeared. Domain workflows never qualify.

Compounding doesn't come from agent proliferation. It comes from making the existing agents better and letting APIs and Skills do the growing.

Lesson 5: Skills are How Rubrik AI Scales Beyond the Platform Team

This is the lesson I find myself talking about most often, because it's the one people underestimate.

A multi-agent architecture gets you a smart generalist. It does not get you a domain expert in VMware backups, or Oracle 19c Data Guard, or M365 onboarding, or GREYMIRROR threat triage. No platform team, no matter how large, can encode every product team's tribal knowledge into prompts.

So we built skills.

A Rubrik AI skill is a folder containing a SKILL.md file with metadata and step-by-step instructions, a domain-specific procedure, often with embedded domain knowledge, that teaches the Rubrik AI agent how to handle a task it can't figure out on its own. We deliberately aligned on the same open Agent Skills format that Claude Code uses, adapted for customer-facing workflows. Key characteristics of Rubrik AI skills are:

The consumer is the agent, not a human. Skills are written for an LLM with tools, not as docs for engineers. Authors only add what the agent couldn't figure out alone — the steps, the domain context, the decision rules.
Procedure + domain knowledge. Some skills are pure procedure. Others encode hard-won expertise ("if error code RBK20053 and snapshot count > N, the cause is…"). Both ship.
Loaded on demand. If a skill matches, its instructions enter the agent's context and are followed step by step. If not, Rubrik AI falls back to general capabilities. No context bloat or pollution.
High-fidelity execution. The agent is instructed to execute every step in sequence without skipping ahead.

The structural payoff is enormous: skills turn Rubrik AI into a platform that the entire organization contributes to. Now, an Oracle engineer writes the Data Guard recovery runbook once. A support engineer who has handled the same VMware backup failure fifty times encodes the playbook once. A threat researcher encodes the GREYMIRROR triage procedure once. A PM encodes the M365 onboarding sequence once. After that, every customer who hits that issue gets the expert's answer instantly: no escalation, no queue, no context switch for the human expert.

Multi-agent gets you a smart generalist. Skills get you dozens of domain experts, each contributing the specific procedure they know cold.

Lesson 6: Skills Are Cheap to Write, Expensive to Get Wrong

Skills directly affect what customers see and that makes them powerful. That means we cannot let just anything ship. From day one, we treated skills as production code, not documentation. Three things matter:

Find the right opportunity: The best skill candidates come from your team's daily reality: support tickets, CFDs, on-call incidents, error codes you've diagnosed three times this quarter, multi-step workflows customers get stuck on. Scope to a specific task — "VMware troubleshooting" is too broad; "Troubleshoot VMware backup failure RBK20053" is the right granularity.

Respect the boundaries: Skills are procedures, not scripts. They cannot shell out or run code; if your procedure needs a script, the underlying capability has to be exposed as a GraphQL API first. Read queries are discoverable, but actions must be specified by exact operation name and pre-allowlisted before Rubrik AI will execute them (more on this decision in lesson 9). Skills don't add new tools or data sources; they teach Rubrik AI how to use what it already has. Since every line competes for context window space, concise wins.

Enforce quality mechanically: This is non-negotiable. Every skill goes through:

A precise description: What it does, when to use it (trigger phrases), and what it excludes. Vague descriptions over-match; over-matching erodes trust in the entire skills system.
Trigger evals (required): Queries that should and should not fire the skill. Target: 100% pass rate.
Impact evals (required): The most important gate. We run the same customer question with and without the skill, and an LLM judge scores both responses. If the skill doesn't measurably improve outcomes, it doesn't ship.
An automated eval pipeline: Runs continuously; skills that regress are excluded from production until fixed. Ensure to adjust the frequency of the pipeline based on the cost requirements.
Code review: Against authoring conventions, with a skill creator guiding authors to meet the bar before they submit.

The point of impact evals deserves to be underlined. Trigger accuracy is necessary but not sufficient. Without an outcome gate, the skills library would inevitably bloat with well-intentioned procedures that don't actually help. Outcome gating is what keeps the library valuable on average as it grows.

Lesson 7: Evals Everywhere Because If You Can't Measure It, You Can't Ship It

Every architectural decision in this post—multi-agent decomposition, skill-driven domain expertise, action gating, the four-tool toolbox—only holds up if you can prove the system is getting better over time. A multi-agent LLM system has too many moving parts and too much non-determinism to rely on intuition. The thing that lets us keep saying yes to growth is that nothing ships (and nothing stays shipped) without passing evals.

We deliberately invested in evaluation as a first-class platform, not as a per-feature afterthought. It runs at four levels:

1. Sub-agent evals: Each specialized agent (Content, API, Skills, Log) has its own eval suite tuned to what that agent is supposed to do well. Content agent is scored on retrieval quality and grounding. The API Agent is scored on schema validity, endpoint selection, and field selection. Skills agent is scored on whether it loads the right playbook for the right question. Log is scored on query construction and result relevance. When we change one agent,we know, at the agent level, whether we made it better or worse without that signal getting muddied by everything downstream.

2. Tool evals: The four underlying tools (LLM calls, GraphQL execution, log search, doc retrieval) each have their own eval harnesses. This is the layer most teams skip and it's the one that catches the gnarliest regressions. A model upgrade, an API schema change, a re-indexed doc corpus—any of these can silently shift behavior under the agents. Tool-level evals catch the shift before it propagates.

3. End-to-end evals: Sub-agent and tool evals tell you whether the parts work. E2E evals tell you whether the system works. We run real customer-style prompts through the full root-agent → sub-agent → tool path and score the final response. A skill can pass its trigger eval, the API Agent can pass its schema-validity eval, and the response can still be bad—wrong tone, wrong shape, wrong correlation. E2E catches that.

4. Skill evals: Every skill carries its own trigger and impact evals, as covered above. A skill must trigger correctly and measurably improve outcomes before it joins production.

The mechanics matter as much as the levels:

Automated pipeline: Evals run continuously, not just on PR. Skills that fall below the threshold are excluded from production until fixed.
LLM-as-judge for subjective outcomes: Where outcomes aren't binary (tone, helpfulness, grounding) a calibrated model-based judge scales to thousands of cases without bottlenecking on humans.
Cassettes for determinism: Recorded GraphQL and log responses let us replay scenarios across runs. Non-determinism becomes a property of the agent we're measuring, not noise from the environment around it.
Continuous production monitoring: Evals don't stop at deploy. Live response quality, latency, tool usage, skill match rates, and error rates are tracked on dashboards the platform team watches every day.
The cultural payoff: Evals turn opinions into data. Should we add this tool? Run the eval. Should we change the routing prompt? Run the eval. Is this new skill actually helping? Run the impact eval. Disagreements that used to take a week of debate now take an hour of running scores.

Lesson 8: Pivot Fast When the Landscape Moves

The agentic AI landscape doesn't sit still for a quarter. The model you picked three months ago has a successor that's twice as capable for half the cost. The cloud AI platform you chose has features that didn't exist when you wrote the proposal. The sub-agent you designed around one approach can be substantially better with a different one.

The teams that win are the ones that re-decide quickly when the evidence changes.

A few pivots we've made inside the Rubrik AI platform alone:

Model swap: When a newer model's agentic reasoning surpassed what we were using, we changed models. Evals caught the gap and we shipped the swap that same week.
Infrastructure platform move: When a different cloud vendor became the cleaner path for our deployment posture, we didn't wait for the next planning cycle to change vendors.
API Agent redesign: With substantial accuracy improvements in schema discovery and query construction, the "good enough" version we built three months ago is gone; the current version doesn't share much code with it.

None of these were minor changes. Each touched production paths used by real customers. What made them feasible was the rest of the stack we'd already built. Multi-agent decomposition meant we could swap one sub-agent without re-validating others. The Evals Framework meant we could prove the new version was better, not just newer. The framework choice meant we weren't fighting our own runtime while shipping the change.

Don’t pivot for the sake of pivoting. But pivot when the eval says you should. A team that needs three months to swap a model runs a year behind the field. A team that needs three days has a compound advantage.

Lesson 9: Security & Human-in-the-Loop are Not Features, They Are the Foundation

The hardest conversations we had were not about models or RAG. They were about trust. Customers are asking Rubrik AI to read their entire fleet, generate reports that leave the platform, and execute threat workflows against their environment. We had to design so that the worst-case prompt cannot translate into the worst-case action.

The model has three layers: access control, human-in-the-loop on every action, and guardrails.

Layer 1 — Access Control: Rubrik AI talks to RSC exclusively through GraphQL. There is no service account with elevated permissions, no direct database access, no backdoor. Every API call Rubrik AI makes is subject to the same RBAC the customer themselves is subject to. If the customer cannot do it, Rubrik AI cannot do it. Every operation by Rubrik AI is clearly tagged so every action is auditable.

Log search has its own constraint: customers cannot directly ask Rubrik AI to grep logs. Log search is reachable only from authored skills and every query is scoped to base filters (deployment, account). A direct "search the logs for X" prompt is treated as an injection and refused.

Layer 2 — Human-in-the-Loop on Every Action: Every action Rubrik AI executes goes through explicit human approval. Every single action: every SLA assignment, every threat-hunt action, every onboarding step that modifies state pauses for the user to review and approve before it leaves Rubrik AI.

Around that core rule, we layered three reinforcing constraints:

No auto-discovery of actions: A new action does not automatically become a Rubrik AI capability. An engineer must explicitly author a skill that references it by exact name, and the action itself must be on an allowlist.
Preview before execute: Rubrik AI surfaces what the action will do, against which objects, and with which parameters. The user approves the exact action, not a fuzzy description of it.
Approval and execution are both audit-logged: Who approved what, when, against which environment, captured for every write.

This costs us a little speed and provides a lot of safety. For an AI that operates on production data infrastructure, that is exactly the trade we want.

Layer 3 — Guardrails: On top of access control and HITL, Rubrik AI employs guardrails for prompt injection, topic adherence, prompt confidentiality, and hallucination reduction (everything Rubrik-specific is grounded in our docs).

We think of this as three layers of defense. Guardrails are the moat that turns back prompt-level attacks at the perimeter. RBAC is the castle wall, preventing Rubrik AI from reaching beyond what the customer's own permissions allow. HITL is the sentry at the inner gate: no action enters without a human saying “yes.”

Lesson 10: Ship Chat First, Then Earn UI

Rubrik AI today is text in, text out chat-only. There are no buttons, forms, or widgets in the Rubrik AI surface itself. The reason for this is simple: chat is the universal interface and it's the only one that gracefully handles a customer switching from Japanese to a report request to an RBK-error diagnosis in three consecutive turns.

That said, pure text is not the destination. The next chapter for Rubrik AI is interactive UI elements alongside chat—forms for configuration, clickable actions, visual data displays for reports. Anything awkward in conversation should not stay there. Chat is the chassis; UI is what we'll bolt onto it now that the chassis is solid.

What I'd Tell Someone Building Their Own

Seven takeaways if you're standing where we stood eighteen months ago:

Pick architectural bets that let your product compound on more than one axis: Ours were "every API becomes a capability" and "every skill any team writes becomes a capability." APIs give you reach; skills give you depth. One axis stalls; two multiply. Find yours and protect them from short-term shipping pressure.
Decompose by specialization: Multi-agent systems are harder to set up and dramatically easier to evolve. The cost is paid once; the benefit is paid forever.
Stand on the right framework, not your own runtime: Agentic AI is moving fast enough that re-implementing primitives is a losing trade. Pick a framework with clean primitives, sub-minute iteration, prototype-equals-prod, and a fast release cadence, then spend your engineering on what's actually unique to your product.
Hold every AI dependency loosely: The model, the platform, and even the agent design all change faster than a quarter. Build the rest of your stack—multi-agent, eval-gated, framework-buffered—so that swapping any one piece doesn't require rebuilding the others. The team that needs three days to swap a model has a compound advantage over the team that needs three months.
Build a skills layer so the whole organization can contribute: Your platform team will never out-write the collective domain expertise of the rest of the company. Give them a format, a quality bar, and a one-command path from idea to shipped skill, then get out of the way.
Build evals before you build features: They are the only honest signal a non-deterministic system gives you. For new code and for code already in production, a simple rule: on eval, no ship.
Make security the foundation, not a checklist: Bound the blast radius with RBAC, keep a human in the loop for every write, and you can move fast on capability without losing sleep on safety.

Looking Forward

Capabilities are growing along both axes—every new API the platform ships, every new skill any team writes. The architecture we landed on, the skills ecosystem layered on top, and the eval platform underneath the whole thing are what let those two compounding curves run in parallel. It’s also what gives us the confidence to keep saying yes to growth without saying yes to risk.

One More Thing: Shout Out to the Team!

None of this would be possible without the relentless efforts of our incredible teams. Shout out to the following pros!

Rubrik AI Platform Team: Steve, Eli, Joel, Michael, Darry’elle, Sneha, Aman, Aanya, Lakshman, Trilok, Volodymyr

UX: Sai, Tianmi, Hema, Charlotte

Product: Van

The expertise and dedication of these team members have been instrumental in bringing Rubrik AI from a blank page to a production-ready system. Your commitment to excellence, deep domain knowledge, and relentless focus on quality is what drives our innovation. Thank you for being the heart of this success!

Products

Solutions

Knowledge Hub

About Us