Disillusioning the Magic of the fork System Call
How the kernels implement the fork system call
Unix-like operating systems famously use the fork system call for creating a new process. The way this system call works in the user code can be quite bewildering when you first learn about it. It creates a child process which is a copy of the parent (with few internal differences), when the system call returns back into user space, you may either be inside the child process or the parent, and you need to check the return value of fork to determine that. A typical example of using fork is shown below. The example uses the execve system call in the child to run the ls program while the parent prints a message on stderr
and exits.
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
extern char **environ; // Use the existing environment
int main(void) {
pid_t pid = fork();
if (pid == 0) {
// Child process: execute "ls"
char *ls_args[] = {"ls", NULL};
execve("/usr/bin/ls", ls_args, environ);
// If execve fails
perror("execve failed");
exit(EXIT_FAILURE);
} else {
// Parent process
fprintf(stderr, "%s\n", "Parent process");
// Wait for the child process to finish
if (wait(NULL) == -1) {
perror("wait failed");
}
exit(EXIT_SUCCESS);
}
}
The confusing part here is: how can we simultaneously end up in both branches of an if
condition? And how does a single system call return two values at the same time? Many books don’t go into the details to explain this. But, we can easily uncover the explanation by peeling back the layers of abstraction—and that’s exactly what we’ll do in this post.
TL;DR: In reality after fork, we have two processes with identical code, but both take different branches of the if block, i.e. they fork their paths after the fork system call. However, how does fork provide two different return values to both of them requires digging deeper.
System Call Calling Convention
To understand how things work at the lowest level requires getting rid of the abstractions in between. Even though in the above example we used the fork()
function call, it isn’t a real C function, it is a wrapper over the fork
system call provided by libc.
System calls are implemented inside the kernel and the user space code needs to use specific instructions to ask the kernel to execute a system call. To avoid having to write assembly by hand to invoke system calls, libc provides these wrappers which do the same thing under the hood.
To understand what exactly is happening and how fork returns two different return values to the parent and child process we have to go below the libc wrapper and understand how system calls are executed at the level of assembly instructions.
Just as there are different calling conventions for ordinary function calls, there is a calling convention for invoking system calls as well. The calling convention dictates how to tell the kernel which system call to execute, how to pass the arguments, and how to provide the return value back to user space.
For X86, the Linux kernel has this simple convention:
Put the system call number in the
RAX
register. The kernel internally has a syscall table where it maps each syscall number to the function which implements that system call.In Linux, a system call can have up to six arguments which can be passed in the registers:
RDI
,RSI
,RDX
,R10
,R8
,R9
(in that order).Return value of the system call is stored in
RAX
The following table lists the calling convention for a few other architectures as well.

After setting the registers, you need to invoke the syscall
instruction which causes a trap into the kernel bringing the kernel into action to execute the system call.
An example of invoking the exit system call is shown below. It places the syscall number for exit (0x3c
) into the rax
register. The exit system call takes only one argument which is the exit status that we place into the rdi
register as the value 0
, and then we invoke the syscall
instruction.
movq $0x3c, %rax # exit's syscall number is 0x3c
movq $0, %rdi # exit with status 0
syscall
Refresher on the AT&T Assembly Syntax
I am using the AT&T syntax here. If you are not familiar with this syntax or are hazy on it, let me give a quick refresher.
Most assembly instructions operate on one or more operands. These operands are provided as one of: register, memory addresses, or immediate values (constants).
In X86 assembly, most instructions have two operands, one acts as the source and the other as the destination. In the AT&T syntax, the source operand appears first and destination operand appears after that.
For instance, consider the instruction
movq $0x3c, %rax
from the above example. Here,$0x3c
is the source operand which is an immediate value representing the syscall number, and%rax
is the destination register where this value is written.You don’t need to know more x86 assembly than this for this article but for the sake of completeness let me add one more point. Often the 2nd operand also acts as the 2nd source (apart from destination). For instance
addq %rdi, %rax
— here the value inrdi
is added to the value inrax
, resulting inrax
being modified with the result. There are a lot more nuances to the x86 assembly that we can’t cover here.
An Illustrative Example of Using Fork and Exec
Now, let’s take the same example as we saw in the beginning of this article but rewrite it in assembly.
I’m going to show assembly code using the GAS AT&T syntax. The start of the program is marked with the _start
label. A label is declared using the label name, followed by colon. We start the program by executing the fork
system call, and to do that we place its syscall number into the rax
register and execute the syscall
instruction.
_start:
movq $0x39, %rax # fork syscall number
syscall
After the syscall, the return value is placed in the rax
register. For determining whether we are inside the child or the parent, we need to compare rax
against 0
. We do that using the cmpq
instruction. The cmpq
instruction works by updating the flags register to indicate the result of the comparison. In this case we are only interested in checking if the zero flag was set or not. We can do that using the jz
instruction (stands for jump-if-zero). We provide the jump target label as an argument to jz
.
cmpq $0, %rax # Compare return value against 0
jz _handle_child
The _handle_child
label calls the _exec_ls
function which invokes the execve
system call to replace the current program image with that of ls program. (I’m not showing the code for _exec_ls
function for brevity)
_handle_child:
callq _exec_ls # call _exec_ls function
On the other hand, if rax
is not 0
, then we are inside the parent. In that case the above jump is not performed, instead the control falls through to the next instruction to continue executing the code for the parent. In this program, we want the parent to print a message on stderr and exit. To do this we call the _print
function to print a message on stderr and after that we jump to _exit
label to exit. Again, I’m not showing the code for _print
function and _exit
label for saving space but you can see the full code in the figure-2 below.
movq $_parent, %rsi # buffer address
movq $15, %rdx # length of buffer
callq _print
jmp _exit # exit
The below figure shows the full program.

If you wish to assemble and execute this program, you can find it here. Just run make
to build the program.
Takeaway: The key thing to focus on is that the return value of the system call is written into the
rax
register by the kernel. So right after the syscall instruction, we need to check the rax register to decide what to do. In the case of fork, we compare the rax register against 0. For the parent process, therax
register would contain the child process’s pid, while in the child process, therax
register would be 0.
The Implementation of the fork System Call
Now that we have unwrapped the user space layer magic of syscalls, it’s time to look inside the kernel. We will take the assembly program from the previous section and understand how the kernel will fork it when it executes the fork syscall. We will do it one step at a time.