Motivation
Rubrik’s fileset team is responsible for developing backup solutions for NAS filesystems (NFS and SMB) and filesystems on Windows and various Unix flavors. The fileset backup solution is one of the most popular Snappables—Rubrik’s term for data source types for backup—used by Rubrik customers. However, as the install base increases, there is an increased need for customer support. This could potentially lead to engineers spending more time troubleshooting and fixing issues raised by customers than advancing product development. To increase engineering productivity, we looked into how to improve troubleshooting efficiency.
When an engineer encounters an issue flagged by a customer, the first step is to figure out what exactly is wrong. An issue described by a customer usually contains superficial symptoms—e.g. this does not work, or this is slow—but an engineer may have to investigate many internal features and components to find out where the root problem is. In reality, the investigation process is usually not as efficient as we hope. Here are several common reasons:
As products continue to develop over time, engineers begin to specialize in specific specialties within the product. This may lead to specialists not being equipped to troubleshoot this instance.
Documentation of features and components is not always accessible. As businesses grow over time, much of this information is tribal knowledge, so engineers have to ask around, which increases time to resolution.
Even with documentation, engineers still need to manually carry out documented steps to complete an investigation.
Even if certain components have scripts that can automate some investigation steps, these scripts often focus on a granular area or feature. As systems grow more robust, memorizing dozens of such small scripts and their usage becomes cumbersome.
How we addressed this at Rubrik
Once we decided that we needed an easy-to-use tool suite to help troubleshoot, we started with a thorough analysis of issues raised by customers. This helped us to uncover what areas are most vulnerable and which are priorities.
Some important requirements for the tool suite are that it needs to work in many different releases as different customers may run different releases, and it does not modify the state of the system. For example, the Rubrik system only hosts read-only operations so it is safe to run the script at any time. Resource usage of the tool is also consciously top-of-mind to avoid taking up too much memory or CPU.
Developing this was an iterative process in which we first developed some necessary libraries, then selected a handful of scripts which analyzed a specific fileset job, or specific file in a fileset. Eventually development becomes an ongoing process where new requirements are created based on new issues and fixed as needed.
Here is an ever growing list of what the tool suite can query and analyze:
All database tables related to the fileset jobs
System level information, like cluster/node/process level information
Fileset job information, including job progress, failures, etc.
Configuration related to fileset jobs and finds out non-standard configuration.
Logs
Index for specific fileset backups
Snapshots of specific fileset backups
For all the analysis done above, an easy to read print out is generated. Potential problems are printed differently than normal information. Since the tool suite prints information in fixed format, we can also use it to generate a report when there is a job failure and upload the report to the cloud for further automated analysis.
In practice, we encourage support engineers to always run the tool suite before they officially file issues for dev engineers to investigate. By doing so, some problems can be addressed by support engineers earlier on. Then, when the issue is transferred to dev engineers, dev engineers have all necessary information to start the investigation.
Summary
Here are the main principles we learned to effectively develop such support tool suite:
Remember that good visibility of the whole system and related components are usually the first and most important step towards discovering the root cause of the issue, so the tool suite should focus on increasing visibility in an easy way as the first step.
Having good documentation to troubleshoot common issues is necessary to avoid repetitive waste of engineers’ time. Keep the documentation up to date. Maintaining good documentation not only saves other engineers’ time, but also feature owners’ time as it saves unnecessary time to answer others’ repeat questions.
What is better than good documentation is easy-to-use scripts which automate troubleshooting. Once good step-by-step documentation is available, make efforts to write or enhance existing scripts to do these steps. This reduces the time cost for new team members to review these documents.
Make an effort to develop a limited number of scripts whose usages are easy to remember. Too many small scripts created in step 3 can be a learning burden to engineers.
Create a process that periodically examines all issues raised by customers and figures out what documents and scripts can be improved to help troubleshoot similar issues in future. Spend time improving documents and scripts. This consistent effort eventually will pay off. Assign dedicated resources on this effort if needed.
Make documenting and troubleshooting script efforts a part of the official development process.
Always keep improving the tool suite to add missing pieces and support for new features. As part of analyzing issues raised by customers, we discovered many issues are repeat occurrences. However, finding out which is the original issue becomes time consuming so try using automated scripts to examine system and job states and match against known issue signatures. As one final piece of advice, extract the common libraries and encourage other snappable teams to write their own scripts on top of them.