When the backup tapes are in the boot of the car: How we cut DR from 48 hours to 30 minutes in a regional production environment

Our setup, and why it matters

Brown Family Wine Group runs a lot more than a winery. Across multiple sites in regional Victoria, Australia, the business spans agriculture, manufacturing, distribution, warehousing, and hospitality. Our primary production environment sits in Milawa, a small town three hours from Melbourne, deep in the King Valley wine country.

For us, keeping mission-critical systems on-premise is the right call. Manufacturing automation, ERP, distribution systems: these need to be local and reliable. Latency matters on a production floor. So our core infrastructure stays on-premise, and our DR site is three hours away, connected over a 100 mb WAN link.

What DR actually looked like before

On a bad day, a real incident, here’s what would happen: we’d retrieve half a dozen tapes from off-site storage, load them into the car with one or two members of the technical team, and drive three hours down the highway to our remote DR location. Once there, we’d power up the equipment and start cataloguing the tapes. That process alone could take several hours. You’d be working out: is this the right set? Is every tape readable? Once the catalogue was built, you could start your recovery: select the first server, run the job, wait for it to complete, stand it up, move to the next one. Iterate through a few dozen machines. At the end, you’d have an environment you could test and, eventually, return to production.

And that’s if everything went to plan. In a Windows environment, restoring a server isn’t just a single step. You’d restore the image, then work through a series of manual configuration checks. Active Directory recovery. Service dependencies. Sequencing. There was real room for error, because you were essentially following a script that relied on no one missing a step. We’d discover issues with a backup set after several hours of work and have to start again.

The result: we tested DR maybe twice a year. Not because we were cavalier about it, but because the process was genuinely painful and resource-intensive enough that doing it more often was hard to justify.

A hardware refresh and a harder question

There was no dramatic near-miss that forced our hand. No ransomware event, no audit finding, no crisis. It was a hardware refresh cycle, and we asked ourselves the question we’d been asking for years: is there now something out there that actually fits our reality?

We knew what we were looking for. Something that could replicate efficiently over constrained bandwidth, sending only what had changed rather than full data sets across a thin pipe. Something that made DR testing genuinely repeatable, not just theoretically possible. Something that didn’t require a team of specialists or significant wrapper automation to operate. And something that would give a small IT team the same quality of recovery capability that larger enterprises with parallel environments take for granted.

When we evaluated Rubrik, the answer became clear pretty quickly. The technology fit how we operate. Global deduplication and incremental-only replication meant it worked within our WAN link rather than fighting it, without requiring significant additional OPEX on bandwidth infrastructure that, in many cases, wasn’t even physically available from the telcos servicing our area. But beyond the bandwidth question, it gave a small IT team the kind of DR capability we’d always wanted and never quite had.

What we built and how it works

Our environment is primarily on-premise, virtualised on VMware. Rubrik protects every workload, there’s nothing it doesn’t cover. That includes mission-critical systems that need to remain on-premise: manufacturing automation, ERP, distribution and warehousing systems. Because we span agriculture, manufacturing, distribution, warehousing, and hospitality across a single business, those systems are deeply interdependent. When they go down, parts of the business stop. Full stop.

We’re running a Rubrik R7000 series appliance protecting 58 VMs across the environment. Replication to our DR site is configured with the DR site set as a replication partner to production, with bandwidth throttling scheduled to run high during evening hours and low during business hours. That scheduling alone made a meaningful difference in how much replication we could push across a constrained link without impacting production operations.

A few things made an immediate practical difference:

Orchestrated recovery removed the human error surface. The manual scripted process we used to follow has been replaced by Rubrik’s Orchestrated Recovery. Critically, the platform handles global deduplication and incremental replication across our constrained WAN link, meaning it’s only ever sending changed data across the pipe, not full images. The runbooks handle the sequencing, the configuration steps, the dependency ordering. We pre-configured boot orders and dependencies upfront, so what used to be a manual, error-prone script is now a repeatable workflow. We still do our own checks after recovery: logging into each machine, verifying stability, manually starting specific services in sequence to confirm application dependencies are met. But that’s a last-mile validation step, not the core process. Rubrik gets you 95 to 99 percent of the way there out of the box.

We have a single Orchestrated Recovery plan covering 34 core production servers, structured into five priority groups based on our previous tape restoration order, just considerably faster.

DR testing became something we actually do. We went from twice-yearly tests, because the process was too painful to do more often, to running DR exercises on a monthly cadence. That shift alone changes your relationship with your recovery capability. You’re not hoping your runbook still works when you need it. You know it does, because you just ran it.

The recovery window is now 30 minutes, not 48 hours. That number deserves context. We’re not spinning up in 30 minutes because we have exceptional hardware or unlimited bandwidth. We’re doing it because the architecture is designed to work with the bandwidth we have. That’s the part that matters for organisations in similar situations.

What makes the 30-minute figure possible is RTO optimization configured to keep a hot standby of each VM on the SAN datastores at the DR site. That means the recovery job isn’t waiting on network transfer time during the event itself. The data is already there. The 30 minutes covers spinning up all 34 VMs in sequence, after which the team runs the usual post-restoration checks: application service verification, AD replication confirmation, and database integrity checks against each server.

Patch testing without spinning up cloud infrastructure. This was an unexpected benefit. Our Orchestrated Recovery workflow restores all 34 VMs onto an isolated VLAN within the DR vSphere cluster, giving us a fully air-gapped sandbox before any changes touch the live environment. In larger organisations, teams have parallel development environments, spare hosts, or cloud subscriptions for this kind of testing. We have a subset of that. What this setup allows us to do is spin up and tear down environments in 30 minutes: grab the most recent backup of a production system, apply a critical patch, run user acceptance testing in isolation, and tear it down once we’re confident. No new cloud subscription. No idle host sitting around after the test. Just fast, on-premise environment management that a mid-market team can actually use.

Cloud Vault replaced the tapes entirely. While our core ERP systems remain on-premise for latency and reliability reasons, we now use Rubrik Cloud Vault as our immutable air-gap. It gives us a clean, immutable copy of our data offsite without the physical handling, the cataloguing, and the three-hour drive.

Three practical tips for IT teams in the same situation

If you’re running IT in a mid-market or regional organisation, constrained bandwidth, distributed sites, a small team, mission-critical systems you can’t afford to lose, here’s what we’ve learned:

1. Prioritise orchestrated recovery over manual sequencing. Move beyond scripts and custom dependency mapping. Aim for a system where the vast majority of recovery tasks execute automatically, and your team is doing last-mile checks rather than running the whole process. The difference in DR test frequency alone justifies the shift.

2. Architect for your real-world bandwidth, not the theoretical one. Build your replication strategy around your existing pipe. Global deduplication and incremental-only replication aren’t nice-to-haves when you’re working with 100mb. They’re what makes the whole thing possible.

3. Test more often than feels necessary. The reason we only tested twice a year wasn’t negligence. It was that testing was expensive and disruptive. When the test process becomes lightweight, you test more. When you test more, you find problems before they become incidents. Monthly DR exercises have changed our confidence in our recovery capability in a way that nothing else could.

Closing thoughts

Businesses operating in bandwidth-constrained, geographically distributed environments have historically had to accept a lower standard of DR capability because the tooling wasn’t designed for them. That’s changing. Rubrik has given a regional wine business with a small IT team the kind of recovery confidence that used to require a lot more infrastructure and a lot more budget.

If you’re in a similar situation and you’re still relying on a process that involves tapes and a long drive, or DR tests that happen twice a year because anything more frequent is just too painful, it’s worth taking a fresh look at what’s now possible.

Contributed by

Pedro Silva

Information Systems & Security Manager, Brown Family Wine Group

IT infrastructure and cyber security professional with 30+ years experience across business systems, networking, storage, servers, and enterprise infrastructure. Thrives in building and managing reliable environments, solving complex technical problems, and keeping critical systems running securely and efficiently. Enjoys getting hands-on with technology and finding practical solutions that actually work.

Damien O'Brien

Information Systems Administrator, Brown Family Wine Group

Information Systems Administrator with over 20 years’ experience in infrastructure, virtualization, and data protection. Focused on practical solutions that enhance backup, recovery, and overall system resilience.

Radhika Rangarajan

Director, Customer Advocacy, Rubrik

Radhika Rangarajan leads Practitioner Customer Advocacy at Rubrik, championing the technical customer and connecting practitioner insights to product direction. A co-founder of Women in Big Data, she has spent 20+ years building products, partnerships, and communities across cloud, data, and AI.