The Linux/UNIX process model creates a new process by cloning the currently running one using the fork() system call. Subsequently, exec()—or one of its variants—loads a new program image into the newly cloned child process. There are a variety of issues that stem from using this approach in modern systems. There are also various widely used techniques and APIs aiming to alleviate the issues in different ways with different degrees of success and quirks.
One of the least quirky techniques is the approach of keeping a helper process whose only purpose in life is to give birth to new processes. A surrogate that takes requests from the main process and launches child processes on its behalf. Since this surrogate does nothing else, it has very little complexity, if any, and avoids a series of problems.
The Cloud Data Management (CDM) software stack at Rubrik running on Linux hosts a large amount of functionality and features. Many parts of the system necessarily have to execute commands and need to call fork()/exec() quite often. This meant that CDM increasingly ran aground of fork()/exec() problems as functionality and codebase grew. Let's look at a summary of the problems and how they impacted CDM.
Address space copying latency: Traditional fork() creates a copy of the memory image of the calling process, potentially copying large amounts of memory to the child. Linux alleviated the issue by marking memory pages copy-on-write during fork(). Initially, the child and parent share all memory pages, and only those written to are copied into separate physical pages in the child's address space. However, concurrent access to memory by many threads in the parent can still cause page copying till exec() is called in the child. The vfork() call was invented in an attempt to take another approach.
Vfork - a quirky fork in the road: With vfork(), the parent and child share the same memory address space context and the calling thread is suspended in the parent till exec() is called in the child. This approach avoids the address space copying latency or potential out-of-memory problems by not duplicating the address space of the parent. However, this means that the child can affect the parent’s memory and has to follow very careful limits on what it does and what library functions it calls before calling exec() to avoid weird side effects from unwanted memory modification in the parent. The JVM internally uses vfork() but that code path is very carefully written along with dire warnings about races and deadlock problems. POSIX added the posix_spawn() API which is internally implemented using vfork() in Glibc. For some fun reading, see here and here.
Closing file descriptors causing high latency: File descriptors opened in the parent are inherited by the child and descriptors that are unused in the child must all be closed to avoid roadblocks. These FDs can be marked with the close-on-exec flag to automatically close them when exec() is called. This is easier and more efficient than closing file descriptors in a loop. However, this closing can in some cases lead to high latency especially when there are thousands of open FDs in a process and they are pointing to a variety of resources. In Rubrik CDM, a data ingest handler can have a very large number of open FDs pointing to resources on FUSE, NFS, iSCSI, files, loop devices, and so on. In some cases, the fd close in the child triggers a cache flush in the underlying resource resulting in exec() latencies many orders of magnitude higher than normal.
Out-of-memory kill issues: The copy-on-write behavior of fork() coupled with the fd close latency described above can cause significant memory copying to the child resulting in out-of-memory problems under extreme load conditions. This would typically cause crashes resulting in a broken backup.
All these pointed to the need to adopt the approach of using a surrogate fork() helper and avoid dealing with fork(), vfork(), and exec() quirks in large and complex programs. This approach is used/discussed in a few places like Xen API project, Chromium fork helper, Bugzilla, and Red Hat’s Developer Forum.
These implementations are good but they are targeted to the use case at hand and specific to a single purpose. Rubrik CDM needed a more generalized solution capable of being used from a wide variety of services. Something like a master offload service that will handle all external command execution requests. It would also need to be transparent to the various services and apps. This was made possible because all Rubrik services use common executor APIs to execute external commands and keep all the process handling and output handling in a single place—whether it is C++, Scala, Java, or Python. The master executor, Execd, also should impose the minimum overhead and latencies, and be secure. In particular, privilege escalation should be avoided. The master executor itself runs with superuser privileges but should not allow unprivileged processes to run commands as superusers.
A careful approach was taken to meet all the requirements for a high-performance execution offload service with a combination of techniques not found in any of the aforementioned implementations.
Some aspects of the design are listed below.
This Execd service is a forking service. It forks off a child to handle each client connection received over local UNIX domain sockets.
The client opens Unix pipes corresponding to stdin, stdout, and stderr as required and passes them to the Execd service in the initial request using the descriptor passing feature of sendmsg()/recvmsg() over UNIX sockets.
It is also possible to duplicate the existing stdout or stderr of the calling client process and pass them to the service so that the child’s output is directed to the same place as the parent like a log file.
The messages are simple Type-Length-Value strings with simple parsing functions. No RPC mechanism is used to avoid overheads. A bounds-checked buffer abstraction prevents direct access to the message buffer memory for safety.
There are some interesting techniques used in the Execd service which parses the exec request message, vfork()s the child waits for it to complete, and returns exit status to the client over the control connection. The diagram below illustrates those.
For UNIX sockets, Linux allows querying the user credentials of the calling client process using the SO_PEERCRED option in getsockopt() call. This allows Execd to drop superuser privileges and assume the same user context as the caller before executing the command. If the caller's credentials cannot be determined then the exec request is denied.
Execd then uses vfork() and exec() to execute the given program. Using vfork() here is safe because Execd is small and simple and is single-threaded. We found vfork() to provide material performance improvement vs fork() even in a small, simple program.
A combination of the techniques of descriptor passing, SO_PEERCRED, and reasonably safe use of vfork() allowed us to build an Exec service capable of safely handling all exec() requirements of all services with high performance and scalability. A performance comparison graph is shown below. This was done using two Linux VMs running on the same Hypervisor and each with 4 cores, 24GB RAM allocated, and running Ubuntu Xenial 16.04 distro. One of the VMs acted as the NFS server from which a share was mounted on the other. The underlying hardware was a 12-core Xeon with 64GB RAM running ESXi with SSD storage.
Further improvements
Notice from the graph that the latency of Execd increases rapidly with 16 threads. Our exec helper is a single thread accepting connections and forking. It is possible to use a pre-forked server design to keep a ready pool of worker processes to handle higher concurrency. This high concurrency is not a requirement in our use cases, so a simpler implementation was used.