When the backup tapes are in the boot of the car: How we cut DR from 48 hours to 30 minutes in a regional production environment

What CPS 230 Asked Us to Do

CPS 230 came into force on 1 July 2025. At its core, it requires APRA-regulated entities to identify critical operations, set tolerance levels for maximum disruption, and demonstrate the ability to operate within those tolerances during severe but plausible scenarios. Maximum tolerable downtime, data loss, and degradation are no longer engineering targets buried in runbooks. They are board-attested numbers.

For New Zealand readers: CPS 230 is an APRA standard and does not directly bind NZ entities, but it reaches NZ in practice. Four of the five largest NZ banks are subsidiaries of APRA-regulated parents, and RBNZ's BS11 and FMA operational resilience work are converging on the same ground. The threat model argument that follows is regulator agnostic.

The implicit model behind most tolerance statements looks roughly like this: an attacker gains initial access, spends days or weeks in reconnaissance, detection eventually occurs, and recovery follows. The tolerance window is sized to accommodate that sequence with a reasonable buffer.

That model has a problem. It assumes the attacker's clock runs at roughly the same speed as the defender's clock. AI-accelerated threat capability, now demonstrably available at frontier-model sophistication, compresses the attacker's reconnaissance, exploit development, and lateral movement into a single contiguous operation. The attacker arrives prepared in a way that previous threat models did not assume. The dwell-and-prepare phase that once gave defenders breathing room no longer behaves the way the tolerance maths assumed.

This is not an argument that CPS 230 is wrong. It is an argument that the inputs to CPS 230 tolerance-setting have shifted, and that Enterprise Architects, not just CISOs, are the people best positioned to recognise it. The CISO owns the threat picture. The EA owns the system that has to absorb it. The tolerance statements are operational commitments made to the board on behalf of the whole business.

Figure 1 below illustrates how the dwell-and-prepare phase has changed. What once required weeks of patient reconnaissance now collapses into a single AI-driven operation.

Where the Old Model Breaks Down

Picture a tabletop exercise at a financial services organisation. The team is sharp, the runbooks are current, the RTO targets are on the whiteboard. Then someone asks a single question.

"If a sophisticated attacker established access four months ago and has been waiting, which restore point do you use?"

The room goes quiet. In most resilience programmes, every backup in the retention window was taken after the attacker arrived. Restore from backup stops being a recovery strategy and starts being a re-infection strategy.

This is the scenario that a compressed dwell-and-prepare phase makes increasingly plausible, and it exposes three foundational assumptions in most recovery programs that are worth re-examining before the next attestation cycle.

Assumption 1: Your Retention Window Contains a Clean State

The traditional recovery model assumes that somewhere in the retention window there is a known-good restore point. Detection happens, the breach is dated, recovery works backward to a clean snapshot.

That model breaks when the attacker's presence predates the retention window, or when a patient attacker has deliberately avoided triggering detection thresholds long enough to poison the available snapshots. The architectural response is not simply longer retention, though that is part of it. The harder requirement is integrity assurance: the ability to interrogate backup content for indicators of compromise before restore, and to identify a known-clean state even when that state sits outside standard retention.

Practical questions worth stress-testing in any tabletop:

Can the recovery team identify the earliest possible date of attacker presence, not just the confirmed detection date?
Is there a process to verify backup integrity before restore begins, or does restore proceed on the assumption the backup is clean?
What is the actual time cost of that integrity verification, and is it accounted for in the tolerance number?

Assumption 2: Recovery Speed and Recovery Confidence Are the Same Number

Most tolerance statements contain a number for maximum tolerable downtime. That number is almost always derived from how fast data can be moved, the mechanics of restore. What it rarely captures is the second, harder number: how long it takes to verify that the restored environment is untainted.

These are different metrics, and the gap between them is where real risk lives. A team that can restore a critical payments service in four hours, but requires eighteen additional hours to confirm no backdoor persists in the configuration and no malicious task survived the restore, has a true recovery time of twenty-two hours. The board paper says four. The operational reality says twenty-two.

Figure 2 makes this gap concrete. The attacker's clock now runs in seconds. The board-attested tolerance runs in hours. The actual recovery clock, accounting for detection, containment, investigation, eradication, and verified restoration, runs in days. The distance between those three clocks is the honest risk position.

The discipline is to establish both numbers explicitly and present the delta to the board before an incident forces the conversation. The choice of whether to restore a potentially tainted recent snapshot versus a verified-clean older one is a high-pressure decision that should be resolved on paper during a calm quarter, not during an active incident.

A useful reframe for tabletop success criteria: rather than measuring whether the service was restored within RTO, measure whether the service was restored within RTO and the environment was verified clean. The second criterion surfaces gaps the first one consistently hides.

Assumption 3: Identity Is a Security Problem, Not a Recovery Dependency

Of the three, this is the most consistently underweighted in recovery architecture reviews.

When the Tier 0 directory is compromised, the entire recovery path is invalidated. An attacker with persistent access to the identity layer can re-establish presence in a restored environment before the recovery team finishes the post-incident report. Recovery runbooks that do not account for identity integrity are runbooks for restoring infrastructure into a compromised trust fabric.

The architectural requirement is specific: an isolated, integrity-assured copy of identity infrastructure must exist outside the production identity plane. Not as a security project managed in a separate workstream, but as a first-class recovery dependency with its own RTO, its own test cadence, and its own attestation.

The audit that reveals this gap most quickly is to layer the identity providers, control planes, and endpoint agents across production and recovery environments side by side. Every shared component is a concentration risk. If a compromise in the wrong place takes out both simultaneously, the tolerance statement is not achievable regardless of what the runbook says. CPS 230's material service provider provisions are designed to surface exactly this, and the current threat model makes it sharper.

A Different Kind of Tabletop

The scenario that surfaces all three assumptions together is worth running deliberately. Select one critical service, such as payments processing or claims settlement. Assume an attacker has been present for ninety days. Challenge the recovery team with four questions:

Can you identify a restore point that predates attacker presence, and how long does that determination take?
Once you have a candidate restore point, how long does integrity verification take before you would bring the service live?
Is the identity layer independently recoverable, or does service recovery depend on identity recovery first?
What is the total elapsed time from the decision to recover to service live in a verified-clean state, and how does that compare to the tolerance number on the board paper?

The gap between the answer to question four and the attested tolerance is the honest architectural finding. In most organisations running this exercise for the first time, that gap is larger than expected. It is also a more useful output than any formal audit, because it produces a specific number the board can act on.

The Urgency Argument

Enterprise Architects in ANZ financial services have a finite window in which defensive architectural decisions are disproportionately valuable, before AI-accelerated offensive capability fully diffuses and the asymmetry that currently exists between well-resourced defenders and the broader threat landscape narrows.

The work is not exotic. Compressing recovery time, hardening identity, building genuine integrity assurance into the restore path, and ensuring recovery paths are genuinely independent of production paths: CPS 230 already asks for most of this. What the current threat model changes is the urgency and the tolerance maths.

The conversation worth having with the board is not whether the organisation is AI-threat ready. That question is malformed and no one can answer it credibly. The right question is: given what we now know about how fast an exploit window can close, are the tolerance numbers attested to last cycle still the right ones, and if not, what changes in the architecture before the next attestation?

That is a question only the Enterprise Architect can credibly lead.

Contributed by

Niraj Naidu

Senior Manager, Sales Engineering ANZ, Rubrik

Niraj brings over 24 years in IT to building and scaling teams and Enterprise Architecture practices across the full technology lifecycle. He has a proven track record in architecture roadmapping, policy development and standards alignment; lifting client satisfaction while lowering organisational risk and in translating strategy into solutions that add real value and stability. Today his focus is cyber security and resilience, helping organisations protect their data and recover fast when attacks get through. A technologist, leader and strategist, he helps customers improve growth, scale and technology alignment so they can realise the full value of their current and future investments.

What CPS 230 Asked Us to Do

Figure 1 below illustrates how the dwell-and-prepare phase has changed. What once required weeks of patient reconnaissance now collapses into a single AI-driven operation.

Where the Old Model Breaks Down

Picture a tabletop exercise at a financial services organisation. The team is sharp, the runbooks are current, the RTO targets are on the whiteboard. Then someone asks a single question.

"If a sophisticated attacker established access four months ago and has been waiting, which restore point do you use?"

Assumption 1: Your Retention Window Contains a Clean State

The traditional recovery model assumes that somewhere in the retention window there is a known-good restore point. Detection happens, the breach is dated, recovery works backward to a clean snapshot.

Practical questions worth stress-testing in any tabletop:

Can the recovery team identify the earliest possible date of attacker presence, not just the confirmed detection date?
Is there a process to verify backup integrity before restore begins, or does restore proceed on the assumption the backup is clean?
What is the actual time cost of that integrity verification, and is it accounted for in the tolerance number?

Assumption 2: Recovery Speed and Recovery Confidence Are the Same Number

Assumption 3: Identity Is a Security Problem, Not a Recovery Dependency

Of the three, this is the most consistently underweighted in recovery architecture reviews.

A Different Kind of Tabletop

Can you identify a restore point that predates attacker presence, and how long does that determination take?
Once you have a candidate restore point, how long does integrity verification take before you would bring the service live?
Is the identity layer independently recoverable, or does service recovery depend on identity recovery first?
What is the total elapsed time from the decision to recover to service live in a verified-clean state, and how does that compare to the tolerance number on the board paper?

The Urgency Argument

That is a question only the Enterprise Architect can credibly lead.

Contributed by

Niraj Naidu

Senior Manager, Sales Engineering ANZ, Rubrik

When The Attacker's Clock And The Regulator's Clock Run At Different Speeds

When The Attacker's Clock And The Regulator's Clock Run At Different Speeds

What CPS 230 Asked Us to Do

Where the Old Model Breaks Down

Assumption 1: Your Retention Window Contains a Clean State

Assumption 2: Recovery Speed and Recovery Confidence Are the Same Number

Assumption 3: Identity Is a Security Problem, Not a Recovery Dependency

A Different Kind of Tabletop

The Urgency Argument

Niraj Naidu

Related Blogs

The Agentic Paradox: Overcoming Goodhart’s Law in the Era of AI and Cybersecurity

Share Your Insights

Learning & Certifications

Share Your Insights

Learning & Certifications

When The Attacker's Clock And The Regulator's Clock Run At Different Speeds

When The Attacker's Clock And The Regulator's Clock Run At Different Speeds

What CPS 230 Asked Us to Do

Where the Old Model Breaks Down

Assumption 1: Your Retention Window Contains a Clean State

Assumption 2: Recovery Speed and Recovery Confidence Are the Same Number

Assumption 3: Identity Is a Security Problem, Not a Recovery Dependency

A Different Kind of Tabletop

The Urgency Argument

Niraj Naidu

Related Blogs

The Agentic Paradox: Overcoming Goodhart’s Law in the Era of AI and Cybersecurity

Share Your Insights

Learning & Certifications

Share Your Insights

Learning & Certifications