Disillusioning the Magic of the fork System Call
How the kernels implement the fork system call
Unix-like operating systems famously use the fork system call for creating a new process. The way this system call works in the user code can be quite bewildering when you first learn about it. It creates a child process which is a copy of the parent (with few internal differences), when the system call returns back into user space, you may either be inside the child process or the parent, and you need to check the return value of fork to determine that. A typical example of using fork is shown below. The example uses the execve system call in the child to run the ls program while the parent prints a message on stderr
and exits.
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
extern char **environ; // Use the existing environment
int main(void) {
pid_t pid = fork();
if (pid == 0) {
// Child process: execute "ls"
char *ls_args[] = {"ls", NULL};
execve("/usr/bin/ls", ls_args, environ);
// If execve fails
perror("execve failed");
exit(EXIT_FAILURE);
} else {
// Parent process
fprintf(stderr, "%s\n", "Parent process");
// Wait for the child process to finish
if (wait(NULL) == -1) {
perror("wait failed");
}
exit(EXIT_SUCCESS);
}
}
The confusing part here is: how can we simultaneously end up in both branches of an if
condition? And how does a single system call return two values at the same time? Many books don’t go into the details to explain this, but it isn’t black magic. We can easily uncover the explanation by peeling back the layers of abstraction that hide the implementation details—and that’s exactly what we’ll do in this post
System Call Calling Convention
To understand how things work at the lowest level requires getting rid of the abstractions in between. Even though in the above example we used the fork()
function call, it isn’t a real C function, it is a wrapper over the fork
system call provided by libc.
System calls are implemented inside the kernel and the user space code needs to use specific instructions to ask the kernel to execute a system call. To avoid having to write assembly by hand to invoke system calls, libc provides these wrappers which do the same thing under the hood.
To understand what exactly is happening and how fork returns two different return values to the parent and child process we have to go below the libc wrapper and understand how system calls are executed at the level of assembly instructions.
Just as there are different calling conventions for ordinary function calls, there is a calling convention for invoking system calls as well. The calling convention dictates how to tell the kernel which system call to execute, how to pass the arguments, and how to provide the return value back to user space.
For X86, the Linux kernel has this simple convention:
Put the system call number in the
RAX
register. The kernel internally has a syscall table where it maps each syscall number to the function which implements that system call.In Linux, a system call can have up to six arguments which can be passed in the registers:
RDI
,RSI
,RDX
,R10
,R8
,R9
(in that order).Return value of the system call is stored in
RAX
The following table lists the calling convention for a few other architectures as well.
After setting the registers, you need to invoke the syscall
instruction which causes a trap into the kernel bringing the kernel into action to execute the system call.
An example of invoking the exit system call is shown below. It places the syscall number for exit (0x3c
) into the rax
register. The exit system call takes only one argument which is the exit status that we place into the rdi
register as the value 0
, and then we invoke the syscall
instruction.
movq $0x3c, %rax # exit's syscall number is 0x3c
movq $0, %rdi # exit with status 0
syscall
Refresher on the AT&T Assembly Syntax
I am using the AT&T syntax here. If you are not familiar with this syntax or are hazy on it, let me give a quick refresher.
Most assembly instructions operate on one or more operands. These operands are provided as one of: register, memory addresses, or immediate values (constants).
In X86 assembly, most instructions have two operands, one acts as the source and the other as the destination. In the AT&T syntax, the source operand appears first and destination operand appears after that.
For instance, consider the instruction
movq $0x3c, %rax
from the above example. Here,$0x3c
is the source operand which is an immediate value representing the syscall number, and%rax
is the destination register where this value is written.You don’t need to know more x86 assembly than this for this article but for the sake of completeness let me add one more point. Often the 2nd operand also acts as the 2nd source (apart from destination). For instance
addq %rdi, %rax
— here the value inrdi
is added to the value inrax
, resulting inrax
being modified with the result. There are a lot more nuances to the x86 assembly that we can’t cover here.
Upcoming Live Session
An Illustrative Example of Using Fork and Exec
Now, let’s take the same example as we saw in the beginning of this article but rewrite it in assembly.
I’m going to show assembly code using the GAS AT&T syntax. The start of the program is marked with the _start
label. A label is declared using the label name, followed by colon. We start the program by executing the fork
system call, and to do that we place its syscall number into the rax
register and execute the syscall
instruction.
_start:
movq $0x39, %rax # fork syscall number
syscall
After the syscall, the return value is placed in the rax
register. For determining whether we are inside the child or the parent, we need to compare rax
against 0
. We do that using the cmpq
instruction. The cmpq
instruction works by updating the flags register to indicate the result of the comparison. In this case we are only interested in checking if the zero flag was set or not. We can do that using the jz
instruction (stands for jump-if-zero). We provide the jump target label as an argument to jz
.
cmpq $0, %rax # Compare return value against 0
jz _handle_child
The _handle_child
label calls the _exec_ls
function which invokes the execve
system call to replace the current program image with that of ls program. (I’m not showing the code for _exec_ls
function for brevity)
_handle_child:
callq _exec_ls # call _exec_ls function
On the other hand, if rax
is not 0
, then we are inside the parent. In that case the above jump is not performed, instead the control falls through to the next instruction to continue executing the code for the parent. In this program, we want the parent to print a message on stderr and exit. To do this we call the _print
function to print a message on stderr and after that we jump to _exit
label to exit. Again, I’m not showing the code for _print
function and _exit
label for saving space but you can see the full code in the figure-2 below.
movq $_parent, %rsi # buffer address
movq $15, %rdx # length of buffer
callq _print
jmp _exit # exit
The below figure shows the full program.
If you wish to assemble and execute this program, you can find it here. Just run make
to build the program.
Takeaway: The key thing to focus on is that the return value of the system call is written into the
rax
register by the kernel. So right after the syscall instruction, we need to check the rax register to decide what to do. In the case of fork, we compare the rax register against 0. For the parent process, therax
register would contain the child process’s pid, while in the child process, therax
register would be 0.
The Implementation of the fork System Call
Now that we have unwrapped the user space layer magic of syscalls, it’s time to look inside the kernel. We will take the assembly program from the previous section and understand how the kernel will fork it when it executes the fork syscall. We will do it one step at a time.
State of the Parent Process Before the fork syscall
Because the fork syscall makes a copy of the parent, it’s necessary that we come up with a picture of the parent’s state right before entering the syscall.
Few things to note in this picture:
I’ve only shown the details about the process which are important to understand the mechanics of fork’s implementation. But the process holds a lot more details inside of it that I’ve not shown, things like open file handles, signal handlers, locks etc.
I’ve shown the memory of the process, which includes the executable code segment and other data segments such as the stack, and heap.
Most importantly, the process has an associated context which is the current state of the register file. At the time of executing the fork syscall in this particular program, there are only a few registers in use. The
rax
register holds the syscall number, while the rest of the general purpose registers are unused and I’ve shown their value as 0 (in reality they might not be 0). Therip
register is the instruction pointer and points to the next instruction to be executed in the program order, which in this case is thesyscall
instruction.
Now, let’s see what happens inside the kernel while executing the fork syscall.
Duplicating the Parent Process
Upon entering the fork
system call inside the kernel, the kernel makes a copy of the parent process. This includes making a copy of the process object, as well as copying the context, the address space, the open file descriptor table, memory mappings etc.
At this point the two processes are almost identical (few things are not duplicated for the child process, see the fork man page for details). The state would look something like the following picture after the fork call:
As a result of copying the memory of the parent, the child has the same executable code as the parent. And it also receives the same set of register values as the parent. The child’s rip
(instruction pointer) points to the same next instruction as the parent. This means upon return back to user space, both will execute the same instruction next: cmpq $0, %rax
.
But the kernel is not done yet. It has a few more things to do before it returns back to the user space. For instance, the kernel allocates a new process id for the child. I will skip the bookkeeping parts from discussion. Let’s focus on how the kernel returns back from the syscall after all the setup is done. After all that is the mysterious part about fork—how it returns two return values?!
Setting up the Return Values and Returning Back to the User Space
As per the standards, the fork system call returns the child process’s process id to the parent, and it returns 0 to the child process. In X86 Linux, the rax
register holds the return value of a syscall, so the kernel needs to update the value in the rax
register for the two processes before returning.
The kernel usually doesn’t need to do anything special for setting up the return value for the parent process because the parent invoked the syscall. Syscalls are designed to copy the return value into the rax register while returning back to the user space.
However, some work is required for the child which cannot return via the syscall exit route because it did not initiate the syscall. The kernel explicitly sets the child’s rax register to 0 and then puts it into the scheduler’s run queue so that the scheduler may schedule it for execution.
Whenever the child process is woken up, it starts execution right after the point of the fork syscall. By having 0 in its rax
register it realizes that it is the child process and it proceeds to do child specific work.
The below figure shows the state of the two processes at this point. I’ve highlighted the few things that have changed by this time in green text.
So, this is how the fork system call works under the hood and gives us the illusion of returning two different values to the same program. In reality, we have two processes with the exact same code, with the same rip
values, but different rax
values, which is why that if condition check may end up in one of the two branches depending on which process it is executing in.
fork Implementation inside the XV6 Kernel
With that high level mental model of fork, we can check out the implementation of real kernels and see how it fits the above description. We will first look at the implementation of fork inside the xv6 kernel, which is an educational operating system from MIT designed for teaching.
The xv6 kernel implements fork in the file proc.c, the below figure shows and annotates all the code.
This is as simple of an implementation as it can be. I’ve annotated all the parts to explain what is happening. You can see how it maps one-to-one with the high level discussion from the previous section.
Linux’s Implementation of fork()
Let’s get adventurous and also see how fork is implemented in a real-world kernel like Linux. In Linux, the fork syscall is implemented in the file fork.c. The actual entry to the syscall is via the function kernel_clone
, which I’ve reproduced in the below image. To keep things simple, I’ve removed a lot of extra details here, you can read the full code for this kernel_clone here.
You can see that this function calls copy_process
to make a copy of the current process. The copy_process
function is where all the copying happens that we will see shortly.
After returning from copy_process
, the child process is fully ready to execute. The pid of the child is captured in the nr
variable for returning back to the user space. The scheduler is asked to schedule the child process for execution via the wake_up_new_task
function call and the syscall returns. This return is for the parent process. The child will start to execute depending on when the scheduler assigns it to a CPU.
Now, let’s look at the copy_process
function where the parent process is copied. Again, this function is huge and I cannot reproduce all of it here. Instead, I have curtailed it to show the interesting parts, you can read the full code here.
The crucial part in copy_process
is the call to the copy_thread
function. This copies the underlying process context from the parent into the child.
As different hardware architectures have different registers and flags to represent the context of a process, the copy_thread
function has architecture specific implementations. We will see the implementation of copy_thread
for X86-64 from the file arch/x86/kernel/process.c.
I’ve highlighted the parts relevant for our discussion here.
In the Linux kernel,
current
is a macro which gets a handle to the current process.The struct
pt_regs
holds the set of register values for a process which are saved during a context switch into the kernel. The function copies the parent’s register values into the child.After copying the register values, it then proceeds to set the value of the
ax
register to 0 for the child, so that when the child is finally executed by the kernel it knows it is the child of the fork.
This is how the fork syscall is implemented in the Linux kernel. Although if you see the complete fork implementation in Linux, it is much more complicated than xv6, but it’s just because Linux has to support various configuration options and multiple hardware architectures. However, the crux of the fork implementation remains the same.
Resources
Sample programs shown in this article are present in this GitHub repo with a Makefile for you to easily build and run them.
Summary
In this article, we took a deep dive into the mysteries of the fork
system call. We've unpacked how fork
manages to continue from the same code and yet make both parent and child go their separate ways. Starting from a high-level view with a practical example, we drilled down into the assembly and kernel details to make sense of it all.
We explored the role of registers and system call conventions, showing how fork
cleverly uses them to return distinct values for parent and child. By looking at both xv6 and Linux implementations, we saw real-life examples of how this magic happens inside the kernel.
Now, when you see the fork
call and its two different paths, it’s clear how it all works out. Understanding this process makes the behavior far less puzzling.
Support Confessions of a Code Addict
If you find my work interesting and valuable, you can support me by opting for a paid subscription (it’s $6 monthly/$60 annual). As a bonus you get access to monthly live sessions, and all the past recordings.
Many people report failed payments, or don’t want a recurring subscription. For that I also have a buymeacoffee page. Where you can buy me coffees or become a member. I will upgrade you to a paid subscription for the equivalent duration here.
I also have a GitHub Sponsor page. You will get a sponsorship badge, and also a complementary paid subscription here.