Disillusioning the Magic of the fork System Call

How the kernels implement the fork system call

Nov 26, 2024

∙ Paid

Unix-like operating systems famously use the fork system call for creating a new process. The way this system call works in the user code can be quite bewildering when you first learn about it. It creates a child process which is a copy of the parent (with few internal differences), when the system call returns back into user space, you may either be inside the child process or the parent, and you need to check the return value of fork to determine that. A typical example of using fork is shown below. The example uses the execve system call in the child to run the ls program while the parent prints a message on stderr and exits.

#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

extern char **environ; // Use the existing environment

int main(void) {
    pid_t pid = fork();

    if (pid == 0) {
        // Child process: execute "ls"
        char *ls_args[] = {"ls", NULL};
        execve("/usr/bin/ls", ls_args, environ);

        // If execve fails
        perror("execve failed");
        exit(EXIT_FAILURE);
    } else {
        // Parent process
        fprintf(stderr, "%s\n", "Parent process");

        // Wait for the child process to finish
        if (wait(NULL) == -1) {
            perror("wait failed");
        }

        exit(EXIT_SUCCESS);
    }
}

The confusing part here is: how can we simultaneously end up in both branches of an if condition? And how does a single system call return two values at the same time? Many books don’t go into the details to explain this. But, we can easily uncover the explanation by peeling back the layers of abstraction—and that’s exactly what we’ll do in this post.

TL;DR: In reality after fork, we have two processes with identical code, but both take different branches of the if block, i.e. they fork their paths after the fork system call. However, how does fork provide two different return values to both of them requires digging deeper.

System Call Calling Convention

To understand how things work at the lowest level requires getting rid of the abstractions in between. Even though in the above example we used the fork() function call, it isn’t a real C function, it is a wrapper over the fork system call provided by libc.

System calls are implemented inside the kernel and the user space code needs to use specific instructions to ask the kernel to execute a system call. To avoid having to write assembly by hand to invoke system calls, libc provides these wrappers which do the same thing under the hood.

To understand what exactly is happening and how fork returns two different return values to the parent and child process we have to go below the libc wrapper and understand how system calls are executed at the level of assembly instructions.

Just as there are different calling conventions for ordinary function calls, there is a calling convention for invoking system calls as well. The calling convention dictates how to tell the kernel which system call to execute, how to pass the arguments, and how to provide the return value back to user space.

For X86, the Linux kernel has this simple convention:

Put the system call number in the RAX register. The kernel internally has a syscall table where it maps each syscall number to the function which implements that system call.
In Linux, a system call can have up to six arguments which can be passed in the registers: RDI, RSI, RDX, R10, R8, R9 (in that order).
Return value of the system call is stored in RAX

The following table lists the calling convention for a few other architectures as well.

Figure-1: Linux’s system call calling convention for arm, arm64, x86 and x86_64. Source: https://chromium.googlesource.com/chromiumos/docs/+/master/constants/syscalls.md

After setting the registers, you need to invoke the syscall instruction which causes a trap into the kernel bringing the kernel into action to execute the system call.

An example of invoking the exit system call is shown below. It places the syscall number for exit (0x3c) into the rax register. The exit system call takes only one argument which is the exit status that we place into the rdi register as the value 0, and then we invoke the syscall instruction.

    movq $0x3c, %rax # exit's syscall number is 0x3c
    movq $0, %rdi # exit with status 0
    syscall

Refresher on the AT&T Assembly Syntax

I am using the AT&T syntax here. If you are not familiar with this syntax or are hazy on it, let me give a quick refresher.

Most assembly instructions operate on one or more operands. These operands are provided as one of: register, memory addresses, or immediate values (constants).
In X86 assembly, most instructions have two operands, one acts as the source and the other as the destination. In the AT&T syntax, the source operand appears first and destination operand appears after that.
For instance, consider the instruction movq $0x3c, %rax from the above example. Here, $0x3c is the source operand which is an immediate value representing the syscall number, and %rax is the destination register where this value is written.
You don’t need to know more x86 assembly than this for this article but for the sake of completeness let me add one more point. Often the 2nd operand also acts as the 2nd source (apart from destination). For instance addq %rdi, %rax — here the value in rdi is added to the value in rax, resulting in rax being modified with the result. There are a lot more nuances to the x86 assembly that we can’t cover here.

An Illustrative Example of Using Fork and Exec

Now, let’s take the same example as we saw in the beginning of this article but rewrite it in assembly.

I’m going to show assembly code using the GAS AT&T syntax. The start of the program is marked with the _start label. A label is declared using the label name, followed by colon. We start the program by executing the fork system call, and to do that we place its syscall number into the rax register and execute the syscall instruction.

_start:
    movq $0x39, %rax # fork syscall number
    syscall

After the syscall, the return value is placed in the rax register. For determining whether we are inside the child or the parent, we need to compare rax against 0. We do that using the cmpq instruction. The cmpq instruction works by updating the flags register to indicate the result of the comparison. In this case we are only interested in checking if the zero flag was set or not. We can do that using the jz instruction (stands for jump-if-zero). We provide the jump target label as an argument to jz.

    cmpq $0, %rax # Compare return value against 0
    jz _handle_child

The _handle_child label calls the _exec_ls function which invokes the execve system call to replace the current program image with that of ls program. (I’m not showing the code for _exec_ls function for brevity)

    _handle_child:
        callq _exec_ls # call _exec_ls function

On the other hand, if rax is not 0, then we are inside the parent. In that case the above jump is not performed, instead the control falls through to the next instruction to continue executing the code for the parent. In this program, we want the parent to print a message on stderr and exit. To do this we call the _print function to print a message on stderr and after that we jump to _exit label to exit. Again, I’m not showing the code for _print function and _exit label for saving space but you can see the full code in the figure-2 below.

    movq $_parent, %rsi # buffer address
    movq $15, %rdx # length of buffer
    callq _print
    jmp _exit # exit

The below figure shows the full program.

Figure-2: A simple X86-64 assembly program to show the usage of fork and execve system calls. The child process uses execve system call to execute ls, while the parent process prints a message on stderr and exits.

If you wish to assemble and execute this program, you can find it here. Just run make to build the program.

Takeaway: The key thing to focus on is that the return value of the system call is written into the rax register by the kernel. So right after the syscall instruction, we need to check the rax register to decide what to do. In the case of fork, we compare the rax register against 0. For the parent process, the rax register would contain the child process’s pid, while in the child process, the rax register would be 0.

The Implementation of the fork System Call

Now that we have unwrapped the user space layer magic of syscalls, it’s time to look inside the kernel. We will take the assembly program from the previous section and understand how the kernel will fork it when it executes the fork syscall. We will do it one step at a time.