When a CPU Instruction Lies: Debugging Silent Data Corruption in the Cloud

In this blog

The Symptom: Undecryptable Data
The Hunt Begins: Auditing the Code
Turning Point: Assert the Impossible
Narrowing to a Node
Catching the Faulty Machine
Down to the Instruction
Why Didn't Standard CPU Tests Catch This?
Recovery: The Silver Lining
Lessons Learned

Imagine this: your encryption code is correct, your tests pass, your keys are valid—but decryption fails anyway.

For more than a year, we chased a ghost inside Rubrik's cloud-native storage engine, Colossus. The error was always the same: cipher: message authentication failed. But the cause eluded every audit, race detector, and stress test we threw at it.

When we finally pinned it down, the culprit wasn't a software bug at all. It was a single x86 CPU instruction, PCLMULQDQ, sometimes returning the wrong answer on one specific cloud virtual machine. It was a carry-less multiplication that was doing a ghost carry.

This is the story of how we found this corruption (caused by defective hardware, not Rubrik software) and how root causing this down to the cpu instruction along with our architecture helped us recover everything.

The Symptom: Undecryptable Data

Colossus encrypts data at rest using AES-GCM (Advanced Encryption Standard in Galois/Counter Mode). For compliance and security reasons, we use a standard Go crypto library that leverages hardware-accelerated CPU instructions.

Starting around April 2024, we began seeing sporadic cipher: message authentication failed errors. A small percentage of encrypted Data Encryption Keys (DEKs) and data packs would consistently fail to decrypt, i.e some data had basically been corrupted.

The failures also clustered in a small number of environments, which made the pattern even more confusing.

While most of the cases were indirectly salvageable, this was not something we could let continue.

The Hunt Begins: Auditing the Code

We started where any engineer would: Logs!

As we dug through logs and customer-filed defects, we discovered that the corruption was not limited to DEKs. Data packs were affected too. This was the first real clue: two independent code paths, both using AES-GCM, both producing the same cipher: message authentication failed error. Either there were two separate bugs producing identical symptoms or the problem was in the shared layer.

Then we tried to audit the shared layers of code. There were two primary suspects:

The storage layer: Where every byte of data passes through serialization, compression, and network boundaries before being persisted
Key management layer: Which handles encryption key hierarchies—encrypting keys with other keys, rotating them, caching them, etc.

We went line by line through both, and more, We ran every test with Go's -race flag and wrote new fuzz tests, hunting for any race that might corrupt in-memory keys or ciphertext. We added new targeted unit tests around every transition where data could plausibly be mutated in flight. We found no smoking gun—but we did make improvements along the way:

We found a number of "benign" data races. No data race is truly benign, of course; so we fixed them. See: https://groups.google.com/g/golang-nuts/c/EHHMCdcenc8
We removed the double-checked locking pattern, which is subtly broken. See: https://go.dev/ref/mem
We added checksums in our in-memory caches to reduce the chance of an undetected in-memory corruption.

None of those bugs, however, could explain the production failures under any sequence of events.

It was starting to look like the shared layer: BoringSSL lib itself was the problem.

Turning Point: Assert the Impossible

Despite all of the auditing and attempts to reproduce, nothing we found could explain the production failures. We needed a different approach: instead of trying to reason about what could go wrong, we would put assertions in production that would tell us, the moment something did go wrong, exactly where it happened.

We added two assertions to the DEK encryption hot path:

1.The first one would catch any non-determinism in encryption or decryption itself: immediately after encrypting a DEK, decrypt it back. If decryption fails, then either encryption produces a bad ciphertext, or decryption is broken on this machine, or both.

2. The second assertion would catch storage-layer corruption: after persisting the encrypted DEK, read it back from storage and try to decrypt that. If the first assertion passed but the second one failed, the problem was in the bytes that touched the disk.

In pseudo-code:

encrypted := Encrypt(plaintext, key)
_, err := Decrypt(encrypted, key)
Assert(err == nil) // (1) catches non-determinism in Encrypt/Decrypt

Persist(encrypted)
fetched := Read()
_, err = Decrypt(fetched, key)
Assert(err == nil) // (2) catches storage-layer corruption

The plaintext was copied (not referenced) before being passed to Encrypt, so by construction we could rule out an in-flight race on the input data.

We deployed the assertions and waited.

About two months later, the errors struck again (notice: multiple errors) and to our surprise, both assertions failed. Decryption returned an authentication-failure error immediately after encryption, ruling out the storage layer entirely. We also logged the corrupted ciphertexts so we could analyze them after the fact, and we could see that the trailing 16 bytes—the AES-GCM authentication tag—were wrong, while the rest of the ciphertext looked sane, at least for the samples we saw.

In other words: encryption itself was producing bad output. Something fundamental about AES-GCM was broken in some pathological cases.

Narrowing to a Node

The assertion logs gave us more than just a failure signal: they included the pod identifier where each failure happened. By joining those pod IDs against the host name (node ids), a striking pattern emerged: every failure within a given time window came from pods running on the same Kubernetes node. We saw the failure for different workloads, different data, different code (not even just Colossus), but the same node pattern remained.

This finally explained why only a handful of environments were disproportionately affected. It wasn't about the data, the workload, or the customer. It was about which physical node the scheduler happened to place the pod on.

Unfortunately, by the time we made this connection, the suspect node had already been recycled out from under us by the cloud provider's normal lifecycle. We had to wait for the problem to recur. And it did recur, roughly every two months. After two false starts where the suspect node was decommissioned before we could investigate, we finally caught one alive. We immediately cordoned it (preventing Kubernetes from scheduling new pods onto it) and got to work.

Catching the Faulty Machine

With the node cordoned and ours to study, we deployed a debug pod on it and ran millions of encrypt/decrypt cycles using a minimal Go program. The issue finally reproduced consistently. After a year of chasing this in production, being able to trigger it on demand changed everything!.

Next, we ran the same program in three configurations to isolate which component was misbehaving:

Build	Hardware-accelerated?	Reproduced?
Our Production Crypto Lib	Yes	Yes
Pure Go (purego build tag)	No	No
Default Go stdlib	Yes	Yes

The issue reproduced only when hardware-accelerated crypto instructions were in use. But this still left a question: was the problem in how Go or BoringSSL invoked those instructions, or in the instructions themselves? To find out, we ran the same test in Python, which uses a completely different cryptographic implementation, OpenSSL. It reproduced there too.

The common denominator wasn't any software library. It was something below the library: the operating system, the hypervisor, or the CPU.

Down to the Instruction

To understand how we narrowed the suspect down further, a quick primer on AES-GCM. It is composed of two operations:

AES-CTR encryption, which transforms plaintext into ciphertext.
GHASH authentication, which computes a 16-byte authentication tag over the ciphertext for integrity verification. GHASH uses the PCLMULQDQ instruction for carry-less multiplication—a core operation in Galois field arithmetic.

We isolated each operation in turn:

AES-CTR alone (encryption without authentication) in billions of iterations returned no errors.
GHASH alone (authentication path with crafted inputs) reproduced the error in a few million iterations.

The problem was in GHASH, which meant the problem was in some special instruction that GHASH uses like: PCLMULQDQ.

To put the final nail in the coffin, we wrote a small C program that directly invoked the PCLMULQDQ instruction via inline assembly, bypassing every abstraction. We validated the program's correctness on a known-good machine, then ran it on the faulty node:


!!! GENUINE SILICON FAULT CONFIRMED !!!
Iteration: 19690169
Operand A     : 0xc57ded290fecde69b4c724259c54dcf0
Operand B     : 0xf7db6ff66057b68c9b60371f02762cc4
Expected (SW) : 0x556f27e77e03aa93a418d6d4672077c0
Actual (HW)   : 0x556f27e77e03ab93a418d6d4672077c0

Spot the difference? The expected result has aa in one byte; the hardware returned ab. A single bit flip—specifically, bit 8 of the upper 64-bit word was set to 1 instead of 0. A carry-less multiplication was producing what we termed a ghost carry.

This pattern repeated consistently across millions of iterations. It wasn't a random bit rot or a cosmic ray. It was a deterministic fault on this specific piece of silicon—or, more precisely, on this specific virtual machine running on this specific physical host behind this specific hypervisor.

Why Didn't Standard CPU Tests Catch This?

We checked the errata sheet for the CPU model (an Intel Xeon E5-2673 v4, an Azure-specific SKU) and found a few potentially relevant known issues, including one related to AES instructions and another about Hyper-Threading causing unpredictable behavior. We had no clean way to determine which patches had been applied, though: the VM reported its microcode revision as 0xffffffff, which is the default value exposed to guests on this hypervisor, not a deliberate concealment.

We also tried to reproduce the issue on a different VM with the same CPU model. It didn't reproduce, suggesting that this was localized faulty hardware on a specific physical host rather than a systemic CPU design flaw. Standard provisioning-time stress tests (memtest86, mprime, stress-ng, and similar) hadn't flagged anything either, because those tools don't exercise specialized instructions like PCLMULQDQ in the patterns required to trip this fault.

This kind of fault is particularly insidious. It doesn't cause crashes, kernel panics, or system instability. It doesn't affect general-purpose computation. It only manifests during one specific cryptographic operation and even then only for a vanishingly small fraction of inputs. Without authenticated encryption to catch the bad tag, this could have silently corrupted data for who knows how long before anyone noticed. We owe the visibility entirely to AES-GCM's authentication tag.

We reported the affected host to Azure. Their team was unable to determine a definitive root cause, but they applied remediations on the underlying hardware. We have not seen the bug recur since.

Recovery: The Silver Lining

The exact mechanism of the bug gave us an unexpected gift: the actual data was never corrupted. AES-GCM encrypts data using AES in CTR mode (which worked correctly on the faulty CPU) and then appends an authentication tag computed by GHASH (which was wrong). That means the ciphertext itself is intact, only the integrity tag is corrupt.

We could therefore recover affected data by performing unauthenticated AES-CTR decryption, bypassing the corrupted GHASH tag. Of course, this sacrifices the integrity guarantee that GCM provides. But Colossus had extra guarantees like SHA256 and quickxor hashes, immutability of data, etc. along with manual checks, which allowed us to verify integrity through alternative channels before trusting the decrypted output.

With this insight, we were able to build a recovery path for when all else fails.

A word of caution for anyone tempted to borrow this idea: never run unauthenticated decryption without an independent integrity check. The whole point of GCM's authentication tag is to detect adversarial tampering. Bypassing it without a separate, trustworthy integrity signal is a security regression, not a recovery strategy.

Lessons Learned

So what did we take away from this experience? A few things:

Hardware isn't infallible. As software engineers, we tend to treat CPU instructions as axioms. If PCLMULQDQ says the answer is X, then the answer is X. Virtualized cloud environments add layers of abstraction (hypervisors, live migrations, aging silicon) that can quietly violate this assumption in ways that are genuinely hard to anticipate.

Authenticated encryption earns its keep. A single bit flip in raw AES-CTR ciphertext would have corrupted data with no detectable signal. The GCM authentication tag is exactly what turned a silent corruption into a loud, immediate, and easy-to-instrument error. The visibility that let us find this bug at all came from authenticated encryption doing its job.

Production assertions are underrated. The breakthrough in this investigation did not come from deeper code review or more sophisticated tooling. It came from a four-line "encrypt, then immediately decrypt" assertion placed in the hot path. It is cheap, it runs in production traffic, and it caught what weeks of audits and race detection could not.

Design for recovery, not just correctness. The same property that made AES-GCM detect the bug also made the data recoverable: encryption and authentication are separate operations, with separate failure modes. Storing independent checksums alongside encrypted data turned out to be the safety net that prevented data loss when GCM's own integrity signal could no longer be trusted.

Validate hardware on the operations you actually use. Standard CPU stress tests didn't catch this fault because they don't exercise specialized instructions like PCLMULQDQ. Teams running cryptographic workloads on cloud VMs should consider running targeted hardware validation—millions of encrypt/decrypt cycles using the actual instructions their workloads depend on—when new nodes are provisioned, before trusting them with production data.

When you have eliminated the impossible, whatever remains, however improbable, must be the truth (Sherlock Holmes, via Sir Arthur Conan Doyle). We spent months assuming the bug had to be in our code. It wasn't until we proved that the hardware itself was lying that the pieces fell into place. Sometimes the right debugging strategy is to write a 20-line C program that calls one CPU instruction in a loop.

This bug took over a year to fully diagnose, from the first reports in April 2024 to catching the faulty machine in mid-2025. It involved auditing thousands of lines of Go code, deploying production assertions, cordoning Kubernetes nodes, writing inline assembly tests, and reading Intel errata sheets. In the end, the root cause was a localized hardware fault on a single cloud VM.

The fix? Take the machine out of the pool, ship self-healing recovery for future occurrences, and never forget: your CPU can lie to you.

Products

Solutions

Knowledge Hub

About Us