Fuzz Testing: How to Avoid Overfitting Your Test Suite and Find Hidden Bugs

TL;DR: Software tests often fail to catch bugs involving complex unanticipated failure conditions. Using random inputs and fault injection in tests can trigger such bugs but it can be challenging to detect them. At Rubrik, we have used this approach to great success, and have devised guidelines for how to apply such “fuzz testing” to new situations. A key observation is that a fuzz test can find many related bugs in different parts of code.

A mission-critical product should perform reliably not only in typical conditions but also in exceptional circumstances such as unique workloads, flaky networks, and third-party API failures. This is especially true for Rubrik’s data management applications, which provide customers a last line of defense when adverse conditions may have already compromised their primary workloads. In this post, I discuss how Rubrik uses fuzz testing to harden our product by subjecting it to random inputs and errors. This helps us elicit rare, unintuitive bugs during development to avoid impacting a customer, possibly during an emergency.

A Simple Example

Suppose we implement the following simplified deduplicating backup system. We hash ingested content using SHA-256 and keep an index in a database that maps each hash to the location of a file containing the actual content that generated the hash. If two contents are identical, they may share the same row in the database, but a column in this row stores a list with a distinct reference for each. Two identical contents could also be stored in two different rows that have the same hash but are distinguished by a nonce that is appended to the hash. Deleting content is a two-step process. First, the reference is removed from the above list. Second, when a background garbage collection job notices the list of references is empty, the database row and corresponding file are deleted. (New references are not allowed to be added once the reference list becomes empty.)

Now, suppose the backup job uses the following process:

Pull data from a source.
Hash the content
Query the database using the hash to see if the content already exists.
If so, update the existing database entry to include an additional reference to the pre-existing content and discard the pulled data. If there is a database error, such as a timeout, fall back to step 5 below for adding a new database row.
If the content did not exist or there was an error above, add a new database entry for the hash with a nonce appended to ensure uniqueness. If row insertion results in an error, delete the row with infinite retries to ensure it is not leaked, discard the pulled data, and fail the job. If row insertion succeeds, store the content in the file and report job success.

An observant team member may notice that there is a bug in the above pseudocode. If the database query that is adding the reference fails in the blue text of step 4 above, we assume that the reference was not added and add a new row into the database in step 5 instead. But what if step 4’s error was a connection timeout and the database server actually did add the reference despite the observed failure by the client? The result of this oversight is not benign. Now the original index entry for the ingested object’s hash contains a leaked reference. As a result, its reference list will never be emptied and garbage collection will never clear this copy of the data.

This clever developer may add a unit test that creates a fake database object and simulates this sequence of events to reproduce the bug, add a code fix to delete the reference with infinite retries in case of failure in step 4 above, and then high-five their teammates for avoiding support cases regarding excessive storage use. But there is still a problem.

While the developer was busy celebrating, they neglected to notice that there was a second, subtler bug in the above pseudocode. Do you see it? The bug in step 4 results from forgetting to clean up a reference if the database client fails but the server succeeds. On the other hand, in the purple text of step 5, database cleanup was not forgotten, but a crucial point was missed. If step 5’s insert query succeeded on the server side despite failing on the client, a different backup job could have already ingested matching content and added a second reference to this row. Then the error-handling code in step 5 above would be deleting live content that is supposed to exist in the system since the other backup job could have already reported success.

So, the clever developer found a relatively minor space leak bug that would at least have had recourse (e.g., clean up the leaked data), and missed a similar corruption bug that would have no recourse. A customer who was affected by the second bug would simply not be able to recover the affected data.

How Could This Have Been Avoided?

To prevent these types of bugs, one option would be to simply try harder. You could have wikis and tech talks discussing race conditions and dangerous coding patterns. You could promote more stringent code reviews to look out for pitfalls. You could ask engineers to be careful to implement more test cases that include exceptional inputs and failures. All of these would be valid actions to consider, but they do not get to the heart of the issue. They are all variations on making your engineering team more “clever.” While fostering engineering expertise and promoting a culture of rigor are laudable goals, it is impossible to even imagine all possible ways in which a system may fail catastrophically.

The key challenge with avoiding such failures is that they involve nondeterminism. Concurrency and error-handling combine to obfuscate program control flow so that it cannot be predicted by mere inspection. Bugs are concealed by the unfathomable number of possibilities for how races may be resolved and what possible states the server could be in after client failures. Learning relevant design patterns and idioms as well as cultivating keen intuition can help. But it is better to rely on a simpler process so product robustness does not depend on uniform superhuman levels of effort and skill.

To find out what that might be, let’s ponder the testing strategy taken by the clever developer above. They noticed a bug, and to reproduce it they wrote a precise unit test that involved a scripted database interaction that was complete with failures. Imagine they thought one step further and asked the question “how could I have written a test that would catch this bug without the benefit of hindsight -- without knowing what the bug was or even whether there was a bug?” But how can we find a bug if we cannot even see it?

Step 1: Reproduce the bug organically.

In production, this kind of bug manifests when a specific combination of errors and race conditions occurs and results in unacceptable behavior of the system. Following this logic, if we simply run a system for a long time with lots of random data, random failures, and high concurrency, we may reproduce the bug. But what do we mean by “unacceptable behavior of the system”? And how do we know that it has occurred?

Step 2: Detect the bug.

Suppose a backup job fails. Should we fail the test? In a traditional test without randomly injected failures, this may be one valid assertion. However, the injection of random errors may cause this and many similarly rigid assertions to fail in cases where there are no actual bugs.

On the other hand, suppose a restore job successfully completes, but the content of the snapshot is corrupted. In this case, no amount of retries could correct the potential harm done, even if a future restore attempt returned an uncorrupted version of the snapshot. Therefore, we could halt the test at this point and report that it failed. We refer to this property of the system as correctness -- no operation in the system should ever successfully return an incorrect value.

Is it sufficient to require only the correctness property? Suppose 100 recurring backup jobs are snapshotting 100 objects. While some jobs are failing most are succeeding, and no corruptions have been detected. At first glance, this performance doesn’t seem too bad. But what if 99 backup jobs are mostly succeeding, and one job is stuck failing every attempt? This pattern indicates a bug that will cause backup compliance failures in production.

So it appears it may be ok for all jobs to fail some of the time, but not ok for some jobs to fail all of the time. We refer to this property of the system as progress -- every operation in the system (jobs, retried API calls, etc.) may fail sporadically or hang temporarily, but should eventually succeed within a reasonable amount of time.

A test that follows the above methodology of fault injection and random inputs is often called a “fuzz test,” which is the term I will use in the rest of this post¹. We have used such tests at Rubrik to harden many core parts of our product, and they have found many subtle issues during development. Even though the systems being tested may be different, we have found that fuzz test development usually follows a common template. In the next sections, we will expand on the ideas above to provide guidelines for successfully implementing fuzz testing, and discuss some of the challenges encountered.

By the way, the above example in which a race involving deduplication caused data loss is not contrived. It is similar to an actual bug encountered by my team. However, we caught this bug with a fuzz test early in development. Instead of hurting our customers and causing ourselves to frantically troubleshoot the issue and lose lots of sleep, the bug was quickly detected, located, and fixed months before it would have had the opportunity to impact a customer.

The Elements of a Fuzz Test

Step 1: Start Unfuzzy

An essential first step to building the kind of fuzz test mentioned above is to start with an ordinary unfuzzy test that simulates the behavior of the system and checks for errors. One part of such a test is a source of truth that can be used to check the correctness property mentioned above. For example, a source of truth for the above backup system could be a real or fake data source that will retain a history of all object versions that have ever existed so that restored contents can be checked for integrity. In some cases, the source of truth may need to account for error injection. Consider the example of a fuzz test for a database. Here, the source of truth would include a record of transactions that the client attempted, whether they succeeded or failed, and their order. Due to injected failures, multiple outputs from the tested database may be considered correct because the database is sometimes left in an undetermined state after a transaction timeout or other failure.

Step 2: Add Fuzz

We need a source of nondeterminism to trigger many unexpected code paths in the above simulation. Although the exact list of such sources will vary from test to test, some common examples include:

The presence of network timeouts or other network errors.
Unexpected clock behavior such as time jumps.
Injected memory leaks that trigger sporadic out-of-memory errors.
Random crashes of the process running the code that is being tested.
Hard resets of nodes.

Additionally, external API query results should have errors randomly injected into them, although the results should be consistent with the specification unless violations of the specification are seen. Input data should be randomized to exercise as many boundary cases as possible. One specific source of nondeterminism that is worth repeating is a client failure that leaves the server in an undetermined state. As such, database dependencies should be wrapped and transactions should be made to randomly fail before and after the transaction is processed inside the database. Nondeterminism is most useful when applied to external calls or environmental conditions the developer cannot control. Therefore, intercepting internal calls to a library and randomly throwing exceptions will likely provide less value.

Step 3: Check Correctness

As suggested above, we need a correctness checker to validate all work done by the system. We can start with standard assertions from the original unfuzzy test. For example, to test a backup system, we may take a snapshot from a data source, then read the snapshot back to compare it to the source of truth. However, retries may be needed to handle injected failures. Some assertions may even need to be removed if failure injection makes them impossible to guarantee, even with retries.

Since some correctness assertions are weakened or removed when converting an unfuzzy test to a fuzz test, it is useful to be strict in other ways. White-box verifications of invariants are one way to catch latent bugs, and their violations are often easier to reproduce than API violations. For example, in the above deduplicating data store, we could perform a white-box check for leaked references. Without such an assertion, the test may only fail if storage runs out. In our experience, fuzz tests are especially likely to produce invariant violations. So even though white-box checks are harder to maintain than black-box ones, they are often worth the extra effort. In addition, asserting that injected failures occur as expected is a useful sanity check. For example, if clock jumps trigger error-handling code, we may want to assert that this code is executed during a fuzz test to ensure that we are exercising this failure mode.

Step 4: Check Progress

The fourth component of a fuzz test is a progress checker, which ensures that useful work continues to be made in the system and nothing gets stuck. For example, in a backup system, we could assert that no backup job or background maintenance job is hanging or failing many times consecutively. One helpful sanity check at the end of a test is to assert that all recurring jobs succeed at least one additional time and that API-calls targeting a random sample of objects all succeed. To ensure success, end-of-test validation can be performed after all fault injection has been turned off.

This combination of a correctness checker and a progress checker has some properties in common with a monitoring and alerting system for production. Following this analogy, a fuzz test asserts that no alerts are triggered during the test. One advantage a correctness checker has over production monitoring systems is knowledge of the actual correct data since we can control the data source when such a test is run.

Challenges in Effective Implementation

We have found the above guidelines to be a useful framework for testing a system’s robustness, but they are not a panacea. As fuzz tests are implemented, obstacles will eventually arise.

The first may come right away once you start trying to run the tests. There will inevitably be failures and some will likely be tricky, rare, and hard to reproduce. Compared to ordinary tests, fuzz tests are especially prone to such difficulties. Because of the interplay of concurrency, injected faults, and adversarial inputs, fuzz test failures are more likely to involve a fiendish combination of races and unforeseen code paths. Sometimes a failure does not even indicate a bug because assertions may have been overly restrictive. It is tempting to ignore tricky failures if they are rare or only involve internal invariants. It is important to resist this temptation because a failure that occurs even once in a thousand runs could be a manifestation of a true bug that has the potential to harm both customers and engineering productivity as development time is lost to firefighting.

Because reproducing failures may be difficult, log messages may be the only clues to work with. Therefore, logging improvements should be made in response to any difficulty root-causing failures from logs. As a welcome side effect, troubleshooting production issues from log messages will likely be easier after going through the process of stabilizing fuzz tests and making such improvements.

But when diagnosing from logs is impossible, reproducing a failure may be the only option for understanding an error. So making this process as painless as possible merits attention. Fuzz tests that can run in an accelerated mode using fake dependencies yet still reproduce such issues are an important asset. They allow rapid debugging iteration using a broader suite of tools including an IDE with conditional breakpoints, extra logging, and full access to the internal state of the fake dependencies.

When fuzz tests are stabilized and new code rolls out into production, it is rewarding to see how stable even complex systems are when they have been thoroughly fuzz-tested. For example, Rubrik’s BlobStore manages the data lifecycle for much of our product while supporting deduplication and optimizing garbage collection and compaction. After completing fuzz testing during development, BlobStore was in production for over a year before significant work was needed to resolve an issue.

While fuzz testing can catch many bugs, it is important not to have a false sense of security because no test is perfect. We have seen two broad categories of bug escape that are especially relevant to fuzz tests, in addition to simply missing a relevant source of nondeterminism:

External APIs may be broken. For example, my team depended on an API that returned both a file and its length. We assumed that the length should match that of the file’s data, but this turned out not to be true in rare cases due to a bug in the API. Though defensive programming is prudent, it is not practical to account for every possibility ahead of time. So be prepared to react quickly to inconsistencies observed in production. Besides fixing any relevant bug, add the inconsistency as a new source of nondeterminism in a fuzz test so related bugs can be found immediately and in the future.
The input generator may miss special cases. For example, an error with a specific, long message may trigger buggy error-handling logic. Or a string representing a filename like “~tmp” may be handled differently in some cases. Stumbling across such cases in a fuzz test may be unlikely unless the source of randomness is aware of these special cases. One final example is that the range of sizes of randomly generated inputs may not cover problematic cases. This may prevent coverage of some overflows (e.g., file path length). To reduce such bug escape, strive to cover known cases in the fuzz test ahead of time. If escape occurs nevertheless, the input generator should be enhanced to catch related bugs.

Due to the above we still occasionally see problems in production in some rigorously fuzz-tested software components. This is not a reason to abandon fuzz testing in general. Rather, it motivates considering fuzz testing as just one important layer in a comprehensive quality assurance plan that includes active surveillance of production. Whenever a problem escapes testing, closed-loop analysis may suggest new test enhancements that can catch not only the escaped bugs but also many potential similar ones.

Take-Home Lessons: Proactive Versus Reactive Testing

Unfuzzy case-based testing is an inherently reactive approach to the quality assurance of a system. The cases tested are limited to those that developers think of ahead of time, and those that are observed later, often in production. Either way, the test suite is closely fitted to specific success and failure conditions, either imagined or observed. As a result, the product is made robust primarily in response to customer complaints.

Conversely, fuzz tests are proactive. They discover bugs by testing cases that have not been imagined by developers, nor seen in production as bugs. To make the most of them, when a production failure does occur, we must not simply fix the bug, add a regression test, and consider the problem solved. Instead, fuzz tests should be generalized to detect the bug as merely a special case of a broader class of failures. Then we can potentially discover other related bugs rather than waiting for them to occur in production as well.

Machine learning provides a relevant analogy in classification. A classifier for emails may label them as “spam” or “not spam” using previous emails as observations to train a model using the sender, subject, and message body as features. Similarly, a test suite acts as a “classifier” that labels each state of the code as “buggy” or “not buggy” using the result of each test case as a “feature.” Following this analogy, an elaborate set of test cases may “overfit” past observations, while a single fuzz test that covers all of these cases “generalizes.”

To illustrate, suppose we observed 10 production bugs and correspondingly added 10 unfuzzy custom regression test cases. If we did this, we would probably feel less confident about our ability to detect future regressions -- or even current bugs that have not yet manifested -- than if we had just one less-tailored test case that could catch all 10 bugs. Moreover, if each unfuzzy test case included an intricate script of inputs and failures, future code rearrangement would be likely to break these tests and falsely report nonexistent bugs. Diagnosing and fixing such “false positives” requires extra work and may desensitize a team to flakiness caused by true bugs. Worse, a test case may be disabled altogether if it breaks so often that it serves as what is sometimes pejoratively called a “change-detector.”

Fuzz tests by design do not suffer from this problem because their component interactions do not follow a carefully written script, and success checkers are written to be general. We would not recommend eliminating all custom test cases in favor of fuzz tests because there is still a lot of value in enumerating anticipated successes, failures, and boundary cases if they do not require too much maintenance. Nevertheless, the above observations demonstrate that well-designed fuzz tests may in many instances have better “sensitivity” and “specificity” for regressions than overelaborate scenario-based test cases.

By following this philosophy, we have found that we can find a variety of previously unimagined bugs early in the development process. Fuzz testing has helped us to harden many of our core systems, including our cloud-native deduplicating content store, our diff-chain-based BlobStore, and our distributed metadata store to name a few. A key lesson from our experience is that no matter the strength of your engineering team, a product quality strategy that relies on universal clairvoyance is bound to fail. A better approach is to accept that developers are fallible and employ a testing methodology that can find many tricky bugs without requiring omniscient knowledge of one’s codebase and all dependencies.

Fuzz testing cheat sheet:

Include an implementation of a “universe” that encapsulates the state of external dependencies of the system and includes a data source to help with end-to-end correctness testing.
Identify the space of all inputs and failures that you want to vary, and pay special attention to the possibility of timeouts that leave a server in an undetermined state.
Create a correctness checker that ensures all successful queries to the system have error-free behavior.
Consider enhancing the correctness checker with white-box validations of internal invariants.
Create a progress checker to ensure that all user-visible and internal processes continue to perform all useful work.
Actively monitor production to look for issues that slip through testing, possibly due to violations of contracts with third-party software and services, and adjust fuzz tests so that they can catch failures that evaded detection.
Strive for generality as much as possible when enhancing fuzz tests so they catch not only escaped bugs but also other related bugs.
Prioritize lightweight versions of fuzz tests that can run fully on a developer’s laptop using fake dependencies, but additionally include heavyweight fuzz tests that use production dependencies and exercise more production code.

¹ Fuzz testing is often associated with randomizing inputs to a single process. Here, we are using a broader definition that includes randomized failure injection and multi-process systems. Chaos engineering is a closely related methodology that is more focused on testing a fully integrated distributed system in production with an emphasis on system-level failures.

Products

Solutions

Knowledge Hub

About Us