In a fast-growing engineering team, it becomes less and less realistic for a single engineer to understand every behavior behind a product. In fact, relying on an individual to make the product consistent often leads to disastrous results, especially for features that get less attention. A well-designed framework alleviates these problems by shaping collective developer behavior and promoting better designs, and ultimately makes a better product for our customers.

Recently, I was commissioned to revamp the error message framework in Rubrik’s CDM product. Although this framework seems small compared to others, I learned a few lessons that can be applied to building frameworks of any size, including:

  • Understand your customers, which include both end users and developers
  • Automate as much as possible, leaving fewer chances for human errors
  • Rules do not matter if they are not enforced by code
  • A framework should help scale the engineering organization
  • Building a framework means changing a culture

What Went Wrong

At Rubrik, we pride ourselves in prioritizing our customers’ feedback and using their insights to shape our product roadmap. This was recently true when we received feedback on the quality of our error messages. For reference, here is a real-life message given to our customers:

Internal server error ‘createSpec is already defined: InternalShardedCreateSpec(b2446ef5-3ef7-4cae-91f4-29cb7c000aea,7169792a-2046-4a11-923f-412a05d3c8f0,NoSharding,112589990684262400,Map(5dd95827-b918-4a0f-941c-18db1292edf2/2b58e37e-5e21-45f3-81d7-0334b06eb1e1/f56bbc1b-c74e-4766-85ee-1fb6f186e4f8_0_112589990684262400 -> ContentId(4b49df71-533e-4cb4-9986-574cd98c09f2,676ef5c1-2afa-41f7-a06b-cfac6adfb2fb’

To us software engineers, this message made perfect sense. It described the exact error and had necessary debugging information. However, it was not easy to decipher for our typical end-user, which made it difficult for them to troubleshoot. As a result, the only real solution for this message was to contact Rubrik support.

Before taking any actions to rectify this situation, we first needed to take a look at how we got here. Since I was at the early stage of Rubrik engineering, I had some historical perspective. It all started when we needed a way to post activity messages (we call these messages events) to users. At the time, we engineers came up with our own requirements, one of which was that all events should be able to be localized. Thus, we made each event message a parameterized string. The idea was that these parameterized strings could be handled over to a localization contractor who would then translate them into other languages while keeping the parameters unchanged. For example, a localization contractor may translate the following message string:

Cannot connect to host ‘${hostName}’ at port ${port}.

into the following simplified Chinese message string

无法连接到主机‘${hostName}’的端口${port}。

Obviously, if there is something that needs to be localized within values of parameters, the localization contractor would not know what those values are, because all they know are parameter names.

Unfortunately, the purpose of message parameters was lost as the engineering organization grew. Engineers started to treat these messages as regular string interpolations like the ones in programming languages such as Scala, which is used heavily in our product. 

Engineers first added parameters that take phrases, such as VMware virtual machine and Microsoft SQL Server database. These phrases needed to be localized but didn’t look too bad if they were not. Then, some creative engineers determined there was a need to explain how errors occurred in event messages, so they added a ${reason} parameter to some events. This parameter can take a full sentence created during runtime and has no chance of being localized. It quickly became the biggest loophole in this error message debacle. Because it was so convenient, developers could just capture any exceptions thrown by lower-level routines and put the exception message into this ${reason} parameter.

The need to add the ${reason} parameter was real. It reflected the confusion with our original design–the events and error messages were treated the same. However, they are different in that events state something happened and error messages describe the reasons why  certain events happened. After realizing this, we determined adding a consistent error message framework was the right solution to this problem.

Know Your Customers

To build this error message framework, just as with any software development, we first need to look at the requirements. What is unusual in this case is that it has two kinds of customers: end users and Rubrik internal developers. Understanding both of them is crucial for the success of this framework.

End Users
To understand end users, we worked together with our wonderful product manager, Siddharth Venkatesh. Our high-level goals were to maximize the users’ ability to help themselves while making our product user-friendly. Our consensus was to let users know the bare minimum information required to understand the problem. That is, every error message we put out there should be purposeful, no more and no less. It should be clear what had happened and what they should do next. Following these principles, we came up with some rules for user visible error messages:

  • An error code that reliably refers to a particular error
  • A cause and an action section to ensure each error message is actionable by users
  • All error messages should be clear and unambiguous without internal jargon
  • All error messages should be able to be localized into languages other than English

What about the Internal Server Error  mentioned earlier? These errors were code bugs. Although we don’t want bugs, unfortunately, they do happen. One principle for anticipating bugs is to error out early. When these bugs happen in the field, only Rubrik support can help fix them. But what should we tell users in these situations?

We decided each internal error would state:

  • One error code, which identifies the place where the bug is thrown
  • One incident ID, which is a UUID to identify the occurrence of a bug
  • A debug comment, which is only put in debug log and not visible to users

Users can report the error to Rubrik support with the error code and incident ID. Rubrik support then uses the incident ID to find the details of the bug along with the debug comment.

Internal Developers
What would developers want from an error message framework? It was easier for me to put on their shoes since I was one of them.

Developers want simplicity and efficiency so that they can focus on creative projects and avoid repetitive work. They want the framework to handle all common functionalities, such as how error messages are sent from one service to another service, and show up in UI and API.

Developers also want multi-language integration so the same error message can be used natively in different programming languages.

This framework should also close loopholes that created this mess in the first place while encouraging good design patterns and preventing anti-patterns. In the end, we hope this framework will promote a customer-oriented culture, where developers view problems more from a user’s angle.

The Framework Design

Anyone that has worked on multi-part systems understands how important it is for different parts to “speak the same language.” The centerpiece of the error message framework is the “language” that different services speak. When an error message generated by one service is sent to another, it should be the same as the error messages generated by the receiving service itself. This centerpiece has two simple parts:

  • A YAML format definition for individual error messages
  • A Thrift definition for communication between different services.

This is an example of an error message definition in YAML:

– name: AddressUnreachable
  error_code: RBK20700005
  text: Could not reach host ‘${hostName}’.
  cause: Unable to reach host ‘${hostName}’ using its network address.
  remedy: Make sure the correct network address is used and the
          network is configured properly so that there is a viable
          network route to the host.

In this definition, remedy is mandatory. In addition to satisfying product requirements, this field pushes developers to think from the user’s point of view. Developers need to have a diagnosis and give a prescription instead of just describing a symptom. The Thrift definition has a similar structure.

Together, these two pieces of design provide a programming language-neutral mechanism to generate and exchange error messages. The rest of the framework is to materialize these two pieces into different programming language constructs that can be easily used in different services. These are the heavy liftings of the framework, which include:

  • Convert error message definitions in YAML to corresponding classes in Scala, C++, Python, and Golang.
  • Convert error message instances to Thrift structures and vice versa.
  • Serialize error message instances to persist in metadata and vice versa.
  • Show error messages in UI, API, emails, and syslog messages.

However, we haven’t addressed the elephant in the room–that we already had an extensive event framework. What would be the relationship between the old and new frameworks? Would we replace the event framework with the new error message framework or they should coexist?

We looked into the event framework. It had some flaws, but it was still effective. Most importantly, events serve different purposes than error messages. So we decided we should have both the event and the error message framework with the following distinctions:

  • Events are system activities
  • Error messages are reasons why some of the activities happen
  • An event may contain zero or more error messages
  • Events that contain error messages are usually failure, warning and cancellation events
  • An error message must be attached to an event to be shown to users

Solving the Scaling Challenge

In the error message framework, we still use parameters, which was the problem for event parameters. How can we prevent it from happening again? 

One approach is to put the error message framework team as the reviewer for all error message changes. As the engineering organization grows, the number of reviews explodes and this method doesn’t scale.

To address this, we decided to censor, or whitelist, specific parameters; the error message framework team only needs to review any change to the whitelist. Any parameters that can be used for a localizable string will be disallowed from the whitelist.

For example, ingestedBytes, vmName and restorePath are allowed but reason, message and string are not. This process is based on the understanding that developers are trained to match parameter content with parameter names. A developer wouldn’t assign a serialized java exception to a parameter named restorePath.

This solves the scaling problem because, although the number of error messages will grow, the number of parameters won’t since new error messages are most likely to reuse parameters from existing messages.

The Execution Plan

Now we conceived a design. But as we all know, execution is in the details. We need to answer how to:

  • prevent the framework from lapsing into chaos
  • make steady progress into adopting the framework
  • cultivate a user-centric culture to ensure the quality of error messages

When In Doubt, Use Code
We are technologists. There is nothing more fulfilling than using code to express our ideas, and in this case, to enforce the rules of the framework.

For example, in the scripts that generate classes from error message definitions, we make various validation and consistency checks, such as:

  • Each messages has all requirement fields
  • No duplicated names and error codes
  • Each error code conforms to the defined format
  • No parameters in the remedy field (so that it can be used as an independent knowledge base article).

In different languages that use the framework, we added various restrictions for using base constructs, such as the generic classes, while adding a set of convenience functions that cater to most use cases. The idea is that developers usually follow paths of least resistance. By providing them the convenience of encouraged patterns, they will be incentivized to use those patterns and it makes anti-patterns easier to spot during code review.

For example, when attaching an error message to an event, we need the serializable form of an error message called errorInfo. We provided various functions to convert exceptions and error message instances into errorInfo. While developers can still construct errorInfo from scratch themselves, it is cumbersome to do so and easy to spot. We are also adding lint rules or automatic review comments to flag this case.

Making Progresses
Our ultimate goal is to improve supportability of our product by giving users clear and actionable error messages. The framework’s job is to enable component developers to reach this goal. That is where the hard work lies. We had an engineering team with hundreds of engineers. Widespread adoption required a combination of socialization and management skills.

To do this, we first focused on raising internal awareness by introducing the forthcoming framework in engineering all hands meeting and later in a tech talk. This allowed us to dive into  how it is designed, use cases, and more. We also wrote wikis on how to report user visible error, report internal errors, and convert general exceptions. We understood engineers were always under a time crunch, but they also wanted to understand everything. Thus, we tried to maximize the information density of these talks and wikis.

Each engineering team has a lot to do. We knew adoption relied on us defining tangible goals, scope of work, and priorities. We used our intelligence data warehouse plus scripting to generate concrete lists of items that each team needs to work on. We assigned priorities to these items based on available customer data. We also measured teams’ progress against these lists.

Constant Attention
A framework needs constant attention, otherwise it will fall into disrepair. Besides bugs, we need to keep a watchful eye on how people are using it. Did we address all use cases? How can we improve usability of the framework? When developers struggle to adopt the framework, is it due to the framework’s design? If not, how can we help change their design to better use the framework? Is it still possible for developers to abuse the framework? If so, how do we plug the loopholes?

We are doing all these mostly through code reviews. Oftentimes, we coach developers on how to write error messages, not just on how to adopt the framework. Our hope is these comments will both enable the engineers that receive them to create better error messages and virally spread across the whole engineering organization and cultivate a user-centric culture. If this cultural influence comes true, we’d say that we have gotten much more than what we initially planned from this framework.

Summary

The experience of building a new framework within an established product was not like anything I’ve done before. It required not only technical skills but also management and socialization skills, as its success largely relied on other developers’ cooperation. It would be much better if we built the right framework at the beginning (which we did for many frameworks upon building our product). But just as no one is perfect, makeovers are inevitable.

We can apply our experience and learnings to future framework overhauls. But more importantly, we should try our best to build the right framework from the beginning with all the lessons we learnt overtime.

This post is part of our new engineering deep dive series! Check out Understanding Cloud Costs  for more under-the-hood engineering stories.