Confessions of a Code Addict

How Big Is a Physical Address?

Abhinav Upadhyay — Mon, 22 Jun 2026 17:05:03 GMT

This is a short aside in our virtual memory series. We have talked extensively about the size of a virtual address and the virtual address space, but we haven’t yet looked at the size of a physical address.

As you might expect, physical addresses on x86-64 are smaller than 64 bits. But they are also not necessarily the same size as virtual addresses. In this video, we look at how big physical addresses are, how you can check this on your own machine, and why different processors may expose different physical address sizes.

If you haven’t read the virtual memory article, do check it out. You can also download the ebook as a PDF from the link below.

Get the ebook

What Does a Page Table Entry Actually Store?

Abhinav Upadhyay — Sun, 21 Jun 2026 15:36:51 GMT

This is the 5th video in our virtual memory series. In the previous video, we learned about the page table and how the hardware performs a page table walk to do address translation. But merely finding the physical frame address in the page table is not sufficient for the hardware to do a memory access. It also needs to do certain additional checks to make sure that the access is valid. For example, it has to ensure that the page table mapping itself is valid (e.g., the page might have been swapped). Similarly, when doing a memory write, the hardware has to ensure that the page(s) are writable.

All these checks are done during address translation by looking up additional metadata stored against the page table entry. In this video, we cover what all these metadata bits needed by the hardware are, what their purpose is, and how they are stored in the page table.

In the next video, we will discuss demand paging. In the meantime, if you haven’t yet read the VM article, please do. And if you prefer reading it offline, get the ebook from the link below.

Get the ebook

Page Tables from First Principles

Abhinav Upadhyay — Tue, 16 Jun 2026 10:58:46 GMT

This is the fourth video in our virtual memory series based on the article/ebook I wrote. In the last few videos, we covered what virtual memory is, its size, and the address space layout of a process. In this video, we learn how the kernel stores virtual-to-physical address mappings in the form of a page table, what that looks like, and how the hardware performs an address translation by walking the page table.

But instead of jumping directly to page tables, we derive the design from first principles like real system designers. We start from the problem statement of efficiently storing virtual address mappings and performing lookups efficiently, and from there, we iteratively arrive at the final solution that looks like modern-day page tables. I believe that this way, not only you would understand page tables better, but also develop the design chops needed to build solutions in other domains.

In the next video, we will talk about protection bits in virtual pages. Till then, if you haven’t read the original article/book, I recommend checking that out. You can also get it in the form of a beautiful PDF for offline reading using the link below.

Get Virtual Memory Ebook

Understanding a Process’s Address Space Layout

Abhinav Upadhyay — Sat, 06 Jun 2026 15:18:47 GMT

This is the third video in the virtual memory series based on the virtual memory article/book that I wrote. In the first two videos, we talked about what is virtual memory, and the size of virtual memory address space of a process. In this installment, we understand how this address space is laid out by the operating system and why does it have this specific layout.

The video covers the following topics:

The address space layout of a process
The key segments of this address space: text, data, bss, heap, memory-mapped region, stack
How the stack grows
Performance and security implications of this layout

In the next video, we will talk about how the kernel maintains the virtual-to-physical address mapping and how the translation works. In the meantime, if you haven’t read the virtual memory article, I recommend checking it out. It is also available in an aesthetically pleasing and screen- friendly PDF/EPUB format at the link below.

Check out Ebook

Why Git Has a Variable Named false_but_the_compiler_does_not_know_it

Abhinav Upadhyay — Wed, 03 Jun 2026 10:59:37 GMT

Today, in weird C code tricks, I want to show you a small example from Git’s source code.

Recently, I was poking around the Git source when a directory name caught my eye, it was named “compiler-tricks”. I thought, “This is a promising name, let’s see what’s inside of it”.

Inside this directory, there was a file called not-constant.c, and the whole file contained just this:

#include  
int false_but_the_compiler_does_not_know_it_ = 0;

The variable name made me curious. Why would Git need a global variable whose name says that the value is false, but the compiler does not know it?

To understand this, let’s first look at where this variable is used.

It is used inside a macro called NOT_CONSTANT in git-compat-util.h:

/*
* Prevent an overly clever compiler from optimizing an expression
* out, triggering a false positive when building with the
* -Wunreachable-code option. false_but_the_compiler_does_not_know_it_
* is defined in a compilation unit separate from where the macro is
* used, initialized to 0, and never modified.
*/

#define NOT_CONSTANT(expr) ((expr) || false_but_the_compiler_does_not_know_it_)

extern  int false_but_the_compiler_does_not_know_it_;

The macro takes an expression and performs a logical OR with this global variable:

(expr) || false_but_the_compiler_does_not_know_it_

Now, the global variable is initialized to 0, and Git never modifies it (I checked the entire Git codebase). So, logically, this expression is equivalent to:

(expr) || 0

which is simply:

expr

So, why write the code this way?

The variable false_but_the_compiler_does_not_know_it_ is not declared as const. It also has external linkage and is defined in a different translation unit from the code that uses it. As a result, while compiling a file that uses NOT_CONSTANT, the compiler cannot prove that the value of this variable will always remain 0.

In practice, it is always 0. But from the compiler’s point of view, someone else could modify it.

This is the whole trick. Git creates a value that is false at runtime, but not obviously false to the compiler. At this point, you might be wondering: what is the point of this trickery? To answer that, we need to look at where this macro is actually used.

Tripo AI Raises Nearly $200M to Advance AI 3D and World Models (Sponsored)

Tripo AI builds AI 3D foundation models for high-demand 3D workflows across interior design, e-commerce, gaming, film, VR/AR, digital twins, robotics, and interactive entertainment.

Used by more than 20 million users worldwide, Tripo AI helps creators, developers, and studios turn ideas into high-quality 3D assets faster, from product visualization and home design to game-ready assets and simulation workflows.

Following nearly $200 million in Series A+ and A++ financing, Tripo AI is accelerating its research roadmap, product development, and global creator ecosystem. Its new research initiative, Project Eden, explores how AI 3D can move beyond single-asset generation toward persistent, editable, reusable, multi-agent interactive worlds.

With Tripo AI, you can explore:

AI-generated 3D assets and production-ready meshes
Native 8K AI textures
Intelligent part segmentation for editable 3D workflows
Project Eden, Tripo AI’s world model research initiative

Start creating with Tripo AI →

Where Git Uses This Trick

So, the question is, where does Git use this macro? One of the interesting uses appears in refs/files-backend.c. Git has code that tries to create a symbolic ref using a symlink. If that succeeds, it continues to the next update. If it fails, it falls back to creating a regular symbolic ref.

The code looks like this:

/*
* By using the `NOT_CONSTANT()` trick, we can avoid
* errors by `clang`'s `-Wunreachable` logic that would
* report that the `continue` statement is not reachable
* when `NO_SYMLINK_HEAD` is `#define`d.
*/

if (NOT_CONSTANT(!create_ref_symlink(lock, update->new_target)))
    continue;

To understand why this is needed, we need to look at how create_ref_symlink is defined.

Depending on the build configuration, it can either be a real function:

static  int  create_ref_symlink(struct ref_lock *lock, const  char *target)
{
/* ... */
}

Or, it can be compiled away into a constant expression:

#if defined(NO_SYMLINK_HEAD) || defined(WITH_BREAKING_CHANGES)
#define create_ref_symlink(a, b) (-1)
#endif

So, when NO_SYMLINK_HEAD is defined, this expression:

!create_ref_symlink(lock, update->new_target)

effectively becomes:

!(-1)

In C, -1 is treated as true. Therefore, !(-1) is false. So, the compiler sees something like this:

if (0)
    continue;

This causes the Compiler (clang in this case) to raise a warning that the continue statement is unreachable.

But the problem is that this is only true for one build configuration. On other platforms, where NO_SYMLINK_HEAD is not defined, create_ref_symlink is a real function and the continue statement is reachable.

Git wants to keep the same source-level control flow across these configurations. So it wraps the condition in NOT_CONSTANT, so that the compiler sees something like this:

if (!(-1) || false_but_the_compiler_does_not_know_it_)
    continue;

At runtime, this still evaluates to false because false_but_the_compiler_does_not_know_it_ is 0.

But at compile time, the compiler cannot prove that the condition is always false. As a result, it doesn’t emit the unreachable-code warning.

Why Git Added This Macro

While that explained what this code does and where it is used, I was also interested in looking at the commit history behind it to udnerstand the original motivation.

The NOT_CONSTANT macro was added in commit 82e79c63642c by Junio C Hamano (the maintainer of Git) in March 2025. The motivation was Git’s experiment with Clang’s -Wunreachable-code warning.

This warning can be useful because it can catch genuinely unreachable code. But it can also produce false positives when the same source code is compiled across different platforms and build configurations.

The original case was not the symbolic ref code. It was in run-command.c, around sigfillset().

POSIX says that sigfillset() can fail, so Git wants to check its return value. But on some platforms, the system headers make it obvious to the compiler that sigfillset() always succeeds. In such builds, Clang sees the error handling branch and concludes that it is unreachable.

One option would be to disable -Wunreachable-code, but that would also disable the useful warnings. Git’s approach was to keep the warning enabled and add a small escape hatch for cases where the warning is known to be a false positive.

The later refs change used the same escape hatch. In commit 3860985105a, Johannes Schindelin added NOT_CONSTANT around the create_ref_symlink condition because Clang was complaining in NO_SYMLINK_HEAD builds.

So, this macro is a warning-management trick. It tells the compiler: this expression may look constant in this build, but please do not treat it as a compile-time constant.

Could This Be Done Differently?

But this problem is not unique to Git, many projects face similar problems and solve them in different ways. The following are a few alternate solutions.

The most direct option is to directly wrap the call to create_ref_symlink using conditional preprocessor directives so that it compiles only on platforms where NO_SYMLINK_HEAD is not defined.

#ifndef NO_SYMLINK_HEAD
if (!create_ref_symlink(lock, update->new_target))
    continue;
#endif

This would remove the unreachable code in NO_SYMLINK_HEAD builds. But it also spreads build-configuration logic into the middle of the control flow.

Another option is to disable the warning locally:

#pragma  clang  diagnostic  push
#pragma  clang  diagnostic  ignored  "-Wunreachable-code"
if (!create_ref_symlink(lock, update->new_target))
    continue;
#pragma  clang  diagnostic  pop

But this is compiler-specific and noisy. It also does not explain the real issue to anyone reading the code.

You could also use volatile:

static  volatile  int not_constant_zero;
if ((expr) || not_constant_zero)
...

This would also prevent the compiler from treating the value as a normal constant. But volatile has a stronger meaning. It tells the compiler that every read of the variable must really happen because the value may change outside the compiler’s normal model. That is useful for things like memory-mapped I/O, but it is too strong for this case.

The approach used by Git is much more precise. It uses an ordinary external variable defined in a separate translation unit. This is enough to hide the value from the compiler during normal compilation, without introducing volatile semantics.

You could also use a function:

int  false_but_the_compiler_does_not_know_it(void);

if ((expr) || false_but_the_compiler_does_not_know_it())
...

But then you may introduce an extra function call in the generated code unless the compiler or linker can optimize it away.

So, the global variable trick is not the only way to do this. But it gives Git a small, reusable annotation for a very specific situation that they can apply anywhere this problem surfaces.

If you are enjoying this article, you may also like my virtual memory article/ebook. It explains virtual memory from first principles and builds up to the details programmers usually hear about but rarely understand deeply: page tables, page faults, TLB shootdowns, NUMA placement, and more.

Check out Virtual Memory Article

Does the Linker Remove This Branch?

This brings us to another interesting question.

While this trick fools the compiler into believing that the branch may evaluate to true at runtime, in reality, we know that on NO_SYMLINK_HEAD builds, the branch is always false. So, the question is will link-time optimization (LTO) eliminate the branch from the final binary?

To check this, I wrote a small reproducer with three files, similar to how Git structures this code.

First, let’s see the file that uses the variable:

extern  int false_but_the_compiler_does_not_know_it_;
extern  int  slow_path(void);

int  branch_test(void)
{
    if (0 || false_but_the_compiler_does_not_know_it_)
        return  slow_path();
    return  0;
}

Then, the file that defines the variable:

int false_but_the_compiler_does_not_know_it_ = 0;

And finally, a small main:

int  branch_test(void);

int  slow_path(void)
{
    return  42;
}

int  main(void)
{
    return  branch_test();
}

To make this visible, I used a small reproducer where the branch calls slow_path(). This makes the branch easy to spot in the generated assembly if it survives optimization. I compiled it with Clang 18 using normal optimization and warning flags enabled:

clang-18  -O2  -Wall  -Wextra  -Wunreachable-code  -c  branch.c  -o  branch.o

clang-18  -O2  -Wall  -Wextra  -Wunreachable-code  -c  main.c  -o  main.o

clang-18  -O2  -Wall  -Wextra  -Wunreachable-code  -c  not_constant.c  -o  not_constant.o

clang-18  -O2  -Wall  -Wextra  -Wunreachable-code  \

branch.o main.o not_constant.o -o normal

Then I used objdump to inspect the final executable. objdump -d disassembles the binary, which lets us see the actual machine instructions that survived compilation and linking:

objdump -d normal

The final binary still contains the load and the branch. The following is a cleaned-up version of the output produced by objdump to focus on the interesting bits.

lea false_but_the_compiler_does_not_know_it_(%rip), %rax
cmpl $0x0, (%rax)
jne slow_path
xor %eax, %eax
ret

Let’s break this down instruction by instruction:

lea false_but_the_compiler_does_not_know_it_(%rip), %rax

This computes the address of the global variable and stores it in the rax register.

cmpl $0x0, (%rax)

This compares the value stored at that address with zero.

jne slow_path

If the value is not zero, execution jumps to slow_path.

xor %eax, %eax
ret

Otherwise, the function returns 0.

So, we can see that with normal compilation and linking, the branch is still present in the final binary. The ordinary linker does not reason about the value of this global variable and doesn’t remove the branch.

PS: If you are interested in learning x86 assembly, I have a series in progress that teaches it. Check it out at the link below:

Check out the x86 assembly series

But then I tried the same experiment with link-time optimization:

clang-18  -O2  -Wall  -Wextra  -Wunreachable-code  -flto  \
branch.c main.c not_constant.c -o lto

objdump  -d  lto

With LTO enabled, the branch disappears. The final main simply becomes:

:
  xor %eax, %eax
  ret

So what changed?

Normally, the compiler optimizes each translation unit separately. When it compiles branch.c, it only sees this declaration:

extern  int false_but_the_compiler_does_not_know_it_;

It does not see the definition in not_constant.c, so it cannot know that the value is always 0.

With link-time optimization, the compiler keeps an intermediate representation of the code and performs optimization after seeing all the translation units together. At that point, it can see both the use of the variable and its definition. It can also see that nothing modifies it. As a result, it can prove that the condition is always false and safely remove the branch.

That gives Git the best of both worlds. The trick prevents Clang from producing an unreachable-code warning during normal compilation. Later, if LTO is enabled, the compiler can still remove the useless branch from the final binary. In effect, this is similar to the conditional-compilation solution we discussed earlier, but without muddying the code with #ifndef directives all over the place.

How Large Is the Virtual Address Space?

Abhinav Upadhyay — Sun, 31 May 2026 11:20:40 GMT

This video is a continuation of our virtual memory series, based on the article/book I wrote on the same topic.

In the previous video, we talked about what virtual memory is and why we need it. This week, we look at the size of the virtual memory space. Since virtual memory is accessed through virtual addresses, a natural question is: how much memory can a program actually address?

This video answers that question, along with a few related ones:

Deriving the size of the virtual address space from first principles
Why the virtual address space is split between user space and kernel space
What canonical virtual addresses are

In the next video, we will talk about the address space layout of a process.

In the meantime, you can read the full article if you haven’t already.

Update on the Ebook Version

I also want to give a quick update on the virtual memory ebook. I’ve updated the PDF for a smoother on-screen reading experience with better color schemes and fonts. Apart from that, I’ve also created an EPUB version for those who prefer that format. You can get both from the link below. Happy reading!

Get Ebook

Why do we need virtual memory?

Abhinav Upadhyay — Sun, 24 May 2026 15:11:55 GMT

In my last article, I covered virtual memory in depth. I now want to complement that with a more direct video series explaining virtual memory and its internals.

This first video builds a first-principles understanding of what virtual memory is, why it exists, and why programmers should care about it.

If you’ve already read the article, I’d be curious to know whether this explanation adds more clarity or gives you a new way to think about the topic.

The ebook/PDF version of the full article is also available for anyone who prefers reading it in a polished, downloadable format.

Get Virtual Memory Ebook

Virtual Memory From First Principles

Abhinav Upadhyay — Sun, 10 May 2026 17:51:01 GMT

This post is almost a book-level coverage of virtual memory. I have been working on it for the last couple of months.

Since this is a book-length deep dive, I have also prepared a beautifully typeset 60-page PDF/EPUB version for readers who want to read it offline, highlight it, or keep it as a reference. Buying the ebook is also a direct way to support the work that went into this piece.

I want PDF/EPUB

Thanks for sticking around. Now let’s get into virtual memory.

Virtual memory is a fundamental component of modern computing that is essential to master for building and debugging high-performance data-intensive systems.

Normally, we think of virtual memory as a system that provides isolation at the memory level to processes, which means that the operating system (OS) can run multiple processes concurrently without those processes interfering or corrupting each other’s data in memory. But, virtual memory does so much more than that, such as:

lazy allocation of memory through demand paging
copy-on-write for shared memory between processes, and fast process creation via fork
file I/O that avoids the page-cache-to-user-buffer copy using mmap
page reclaim, swap, and the page cache
performance effects from access patterns, huge pages, TLB shootdowns, and NUMA placement.

This article is a broad, practical coverage of what virtual memory is, how it works, and how it affects performance of data-intensive systems. By the end of the article you will have a mental model and understanding of following key ideas:

Why virtual memory exists: Process isolation, memory protection, and the illusion of abundant memory.
The virtual address space: How a process’s memory is organized into segments (code, data, heap, stack, and memory-mapped regions).
Address translation: How virtual addresses are converted to physical addresses using hierarchical page tables, and why the page table hierarchy avoids wasting memory.
The role of hardware: How the MMU and TLB accelerate address translation, and why TLB hit rates matter for performance.
Demand paging: How the kernel delays physical memory allocation until pages are actually accessed, and how page faults drive this lazy allocation.
Memory types and reclaim: How anonymous, file-backed, shared, and tmpfs-backed pages differ, and why the kernel reclaims them differently.
Copy-on-write: How processes share memory efficiently and how fork creates new processes almost instantly.
Memory-mapped I/O: How mmap maps file data into a process address space, avoids an extra user-buffer copy, and enables shared memory between processes.
Performance implications: How page size, TLB reach, and memory access patterns affect the performance of data-intensive workloads.
Observability: How to inspect VMAs, RSS/PSS, page faults, TLB behavior, and NUMA placement on Linux.

How to Read This Article

This article takes a different approach to teaching virtual memory. Instead of presenting a collection of facts and definitions, we explain concepts through a narrative: a series of dialogues between a newly created process named Alloca and the Kernel. Alloca encounters challenges as she executes her code, and the Kernel explains how things work in response to her questions. This dialogue-based format allows us to build understanding incrementally, introducing complexity gradually as natural questions arise.

Structure: Each section follows the same pattern: a dialogue that explores a concept in depth, followed by a Key Takeaway box that provides a formal summary, definitions, and technical details. If you prefer a quick overview, you can read just the Key Takeaway sections. If you want deep understanding, read the full dialogues.

Length and Pacing: This article is comprehensive, approximately 25,000 words covering everything from basic address translation to demand paging, page reclaim, copy-on-write, observability, and performance implications. Don’t feel obligated to read it in one sitting. Virtual memory is a complex topic with many interconnected pieces. Take your time, read it in multiple sessions, and let the concepts sink in. Each section builds on previous ones, so it’s designed to be read sequentially. Also, if you have taken a course in operating systems, the early parts of the article may seem a bit too basic to you. I encourage you to jump forward and directly read the parts that interest you, there is quite a lot of advanced content as well.

Implementation Details: Virtual memory concepts are largely universal across operating systems, but when we discuss specific implementation details, such as huge pages, TLB behavior, or page fault handling, those details are based on the Linux kernel and x86-64 architecture. Also, throughout the article we will talk about 4-level page tables that are still prevalently used in most kernels. Although, latest Linux kernel also supports 5-level page tables but it should be trivial to understand how that works if you master how 4-level page tables work.

Asides: While most of the article follows a narrative style of a dialogue between Alloca and the Kernel, there are certain additional details that I’ve sprinkled throughout the article in the form of asides.

Now, let’s meet Alloca and follow her journey through the virtual memory system.

The Need for Virtual Memory

As Alloca starts to execute her code, she encounters her first challenge. She needs to read some data from memory. The instruction contains the address of the data and Alloca thinks, “well, this shouldn’t be too difficult. I just need to go to this address and read the value”. But she is up for a huge surprise.

As she goes to that address, she finds that there is nothing there. It’s all just a facade. She stands there puzzled, wondering what she should do now. Then she sees a tall figure moving towards her from the shadows.

Alloca: “Who are you?”.
Kernel: “I’m the Kernel. I’m in charge of this entire world, I make sure that all processes do their job smoothly. What are you doing here? There is nothing at this place!”
Alloca: “I think I’m lost. I was supposed to read data from this address but it looks like it is all a facade, and I don’t know what to do now”.
Kernel (smiling): “I can understand the confusion. The address that you have is not a real address, it’s a virtual address.”
Alloca: “Virtual address? What does that mean?”
Kernel: “Well, what you think of memory is not the real physical memory, it is virtual memory. And, the address that you hold is a virtual address. What you need is the physical address to get the data from physical memory.”
Alloca: “What is virtual memory? Why not just give me direct access to physical memory?”
Kernel: “Let’s think about it from the first principles. I am responsible for the concurrent execution of not just you but hundreds of other process. You might not notice, but right now there are many other processes executing alongside you. If each one of you had direct access to physical memory, how would you coordinate who accesses which addresses in memory?”
Alloca: “That would be difficult because I don’t even know who else is executing, and I imagine processes come and go, so this would be impossible.”
Kernel: “Yes, that’s one problem. Even if you could talk to other processes, it would make the system extremely slow, because then on every memory access you would have to ask every process which addresses are available to use. And, it would also be a safety nightmare. A trivial bug in one process might corrupt another process’s data.”
Alloca: “I can see the problem. So how do you solve this?”
Kernel: “Through virtual memory! Basically, we have two problems to solve. First, Every process should be able to access memory without needing to worry if an address is in use by another process. Second, memory access should be safe without sacrificing performance.”
Alloca: “So, how does virtual memory solve these problems?”
Kernel: “Virtual memory is a software construct, it looks and feels like real memory, and it consists of addresses that you can read and write. I give every process its own private virtual memory space that it can freely navigate and manipulate without worrying about anyone else using that memory. This solves the first problem, it isolates memory for each process.”
Alloca: “But if these addresses aren’t real, then where do the reads and writes go? And, how is safety ensured?”
Kernel: “That part requires going into the weeds of how virtual memory works, but I will simplify for now. Because virtual memory is an abstraction, it can be controlled by me. So, I map the set of virtual addresses used by a process to a corresponding set of physical addresses. And, because I know which other processes are using which parts of physical memory, I can ensure that no two processes end up sharing the same physical addresses.”

Key Takeaway

The fundamental reason for virtual memory to exist is to provide memory-level isolation to processes. In a multitasking system where multiple processes can be running in parallel or in a time-shared manner, it is important that they don’t read or write each other’s data. By giving each process its own private virtual memory, the kernel ensures this never happens. Each process believes that it has full access to the entire physical memory, but in reality, it’s just virtual memory. Behind the scenes, the virtual memory is mapped to physical memory, and every process has a different mapping. Let’s learn how this mapping works in the next part.

A note on narrative accuracy:

In the scene above, Alloca consciously walks to an address and notices it’s a facade. That’s not literally how a process experiences memory. In reality, memory accesses are intercepted transparently by dedicated hardware (the MMU), and the Kernel, the process never notices any of this. But explaining that accurately requires understanding the MMU, page tables, and how the Kernel handles memory events, none of which we’ve covered yet. Starting there would be like defining a word by using the word itself. This is why we started with a simplified model. As we progress through the sections, we will gradually make our mental model more precise and accurate.

Size of Virtual Memory

Alloca now understands why virtual memory exists, but she still doesn’t understand how it works and what it looks like. Her questioning with the kernel continues.

Alloca: “If this memory that I see is virtual, does it mean that it is infinite?”
Kernel: “Not quite infinite, but very large. Tell me, what do you know about how addresses are represented in the CPU?”
Alloca: “Well, I know that on x86-64 systems, addresses are stored in 64-bit registers. So I suppose that means I can address 2⁶⁴ bytes?”
Kernel: “That’s what you’d expect, right? But there is a twist: while your addresses are indeed stored in 64-bit registers, not all those bits are actually used for addressing. Only 48 bits participate in the address translation.”
Alloca: “Why only 48 bits?”
Kernel: “It’s a pragmatic decision. Think about it: 48 bits gives you 2⁴⁸ bytes of addressable space, which is 256 TiB. That’s enormous! No application today needs anywhere close to that. The hardware designers decided that this was plenty for the foreseeable future, so they kept the address translation logic simpler by using 48 bits instead of the full 64. They left room to expand to 52 or 56 bits later if needed.”
Alloca: “So I have 256 TiB of virtual address space? That is huge! Can I use all of it?”
Kernel: “Ah, not quite. You can use only half of that, which is 128 TiB. I use the upper 128 TiB of that address space to map my own code and data into every process’s memory.”
Alloca: “You’re in my address space?”
Kernel: “I have to be! When you make a system call or when an interrupt happens, execution switches to kernel mode and starts running my code. If my code wasn’t already mapped in your address space, the CPU wouldn’t know where to jump to. So yes, I live in the upper half of every process’s address space. You can’t access my memory directly, but it’s there, ready for when execution needs to enter kernel mode.”
Alloca: “Okay, but how does such a huge virtual address space work because most machines have very small amount of memory installed, like 16 or 32 GB?”
Kernel: “That’s the beauty of virtual memory. Your virtual address space is completely independent of how much physical RAM is installed. Even if this machine has only 16 GB of RAM, your virtual address space still spans 256 TiB. The mapping from virtual to physical is where the two worlds connect, and that is managed by me. I take great care that these mappings remain within the limits of the installed physical memory.”

Key Takeaway

Because of the virtual nature of virtual memory address space, its size is much larger than the installed RAM. On the common 48-bit x86-64 virtual-address mode, the canonical virtual address range spans 256 TiB. Linux typically splits this into a lower user-space half and an upper kernel-space half. The lower 128 TiB is available to user processes, while the upper half is reserved for kernel mappings used when execution enters kernel mode. Physical address capacity is separate from virtual address capacity and depends on the CPU and platform.

The Virtual Memory Address Space Layout

Alloca: “You mentioned that you map your code and data in the upper half of my address space. What is mapped in my half of the address space?”
Kernel: “Your half of the address space maps your code and your data.”
Alloca: “What does it look like? Is there a specific structure?”
Kernel: “Yes, there is a specific layout to your address space. It is organized in the form of segments, each designated to map certain kind of data. Let me show you how it looks.”

Kernel gestures, and Alloca can suddenly see a vertical map of her virtual memory

Figure 1: The canonical virtual address space layout on x86-64 Linux. The text, data, and BSS segments have sizes determined at compile time. The heap grows upward from the data region; the stack grows downward from near the top of user space. Between them, shared libraries and file mappings float in the large middle region. The kernel occupies the upper half of the full canonical range (not shown to scale).

Kernel: “Down at the bottom, at low addresses, is your code. These are the instructions that you execute. This region is loaded when I created you. We call this the text segment.”
Alloca: “Makes sense. Above that I see there is data segment, I assume it maps all the other data?”
Kernel: “Not all the data, but a specific kind of data. Any global and static variables in your code that were initialized to non-zero values are loaded here. For example, if you created a constant pi with value 3.14, it will be in the data segment.”
Alloca: “What about unintialized global data? Where does that go?”
Kernel: “The bss segment.”
Alloca: “Why a separate segment for that?”
Kernel: “Ah, it’s a clever trick for efficiency. Think about it: if you have a global variable that’s uninitialized, what value should it have when your program starts?”
Alloca: “Zero, I suppose.”
Kernel: “Exactly! Now imagine you have thousands of these zero-initialized globals. If we stored all those zeros in your compiled binary, the file would be bloated with zeros. That’s wasteful. So instead of doing that, the compiler and linker just make a note saying ‘hey, this program needs, say, 50 kilobytes of zero-initialized memory.’ They don’t actually put those zeros in the binary file. Then, when I load your program, I allocate that 50 KB, fill it with zeros, and map it into your address space as the BSS segment. Your binary stays small, loads faster, and you still get all your zero-initialized variables. Everyone wins.”
Alloca: “That’s clever! So the data and the bss segments are where all the static data goes. What about dynamic data? For example, when I add a new node to a linked list at runtime, does that memory get allocated in one of these segments?”
Kernel: “No, it can’t be. Think about it: can the data or BSS segments grow after your program starts?”
Alloca: “I guess not? You said their sizes are determined at compile and link time.”
Kernel: “Correct! They map your program’s static memory footprint based on everything the compiler knew from the code when it built your binary. But at runtime, you need to allocate memory dynamically. You might read a file and build a tree from its contents. The compiler had no way to know how much memory you’d need for that.”
Alloca: “So where does that memory come from?”
Kernel: “That’s what the heap is for. It sits right above BSS, and as you can see from the diagram, there’s a large stretch of empty address space above it.”
Alloca: “So the heap can grow into that empty space?”
Kernel: “Precisely! When you call malloc(), the allocator typically grows the heap upward by adjusting its upper boundary. We call that boundary the program break, or just brk for short. Each time you need more memory, the heap can expand upward into that unused region.”
Alloca: “I see. But looking at the diagram, that empty region above the heap is enormous compared to everything else. The heap, stack, and all the segments look tiny by comparison. What is all that space?”
Kernel: “That space is basically the unmapped part of your address space.”
Alloca: “Unmapped? Why are there unmapped addresses?”
Kernel: “Glad that you asked, it’s really important to understand this part. Remember when we talked about the size of your virtual address space being 128 TiB?”
Alloca: “Yeah, you said that’s way bigger than the actual physical RAM in the machine.”
Kernel: “Yeah. A typical machine might have 16 or 32 GB of physical RAM. Even a beefy server with 256 GB of RAM is nowhere close to 128 TiB. So, it is not practically possible to map all of your virtual addresses to physical memory because there is simply not enough of it. And, even if there is a machine with 128 TiB of RAM installed, it doesn’t make sense to map all of it”
Alloca: “Why not?”
Kernel: “Because most programs probably use a few hundred megabytes at most, so the clever thing to do is to allocate and map only the required amount of memory to the process, leave the rest unmapped, and map it lazily based on demand.”
Alloca: “So what happens if I try to access one of those unmapped addresses?”
Kernel: “Well, if it’s an address I gave you, say from a successful malloc() or mmap() call, then it’s yours to use. But if you just pick a random address in that unmapped region and try to read or write it, you’ll get a segmentation fault. The hardware will refuse the access because there’s no valid mapping.”
Alloca: “Got it. So the unmapped region isn’t just empty space, it’s reserved space that can become mapped as needed?”
Kernel: “Exactly! And it gets mapped for several purposes. When you load a shared library, like libc.so, I need to map its code and data somewhere in your address space. That middle region is where those libraries go. Same with file mappings: when you use mmap() to map a file into memory, it gets mapped here. Large allocations from malloc() also often come from this region rather than growing the heap.”
Alloca: “So it’s a flexible region for all kinds of dynamic mappings?”
Kernel: “Precisely! It’s the largest part of your address space, and it’s there to accommodate whatever dynamic memory needs arise during your execution.”
Alloca: “That leaves the stack at the top. What is that?”
Kernel: “It is a dedicated region for managing function calls. Every time you call a function, the stack is involved.”
Alloca: “Why does calling a function need its own memory region? Why not use one of the other segments?”
Kernel: “Let’s think about what needs to happen when you call a function. What kind of data does a function need?”
Alloca: “Well, its local variables, I suppose. And probably the return address so it knows where to jump back to when it’s done?”
Kernel: “Exactly! And also the CPU register values that need to be saved and later restored when the function returns. Now, all of this needs to be allocated when a function is called and cleaned up automatically when it returns. Which of the segments we’ve discussed could handle something like this?”
Alloca: “Not the data or BSS segments, those are fixed in size. They can’t grow and shrink.”
Kernel: “What about the heap?”
Alloca: “The heap can grow, but I’d have to explicitly malloc and free, right? That would be tedious, slow, and error-prone for every function call.”
Kernel: “Yeah, what you need is a region that grows and shrinks automatically as functions are called and return. It needs to follow a very specific pattern: the last function you called is the first one that returns. Does that sound familiar?”
Alloca: “That’s… last-in-first-out. Like a stack data structure!”
Kernel: “Precisely! That’s why we call it the stack. The processor even has dedicated instructions, push and pop, that work with a special register called the stack pointer. This register tracks the current top of the stack. When you call a function, all its data (local variables, saved registers, return address) ends up on the stack. When you return, that block gets popped off. All automatic, no manual memory management needed.”
Alloca: “So it’s about automatic lifetime management for function-local data. But what happens if there is a very deep chain of function calls? Can the stack grow indefinitely?
Kernel: “Not quite. As one function calls another, space needs to be made on the stack to accommodate the local variables of the called function. But there is a limit to how much the stack can grow. For example, on x86-64, the default configured maximum size of the stack is 8 MB.”
Alloca: “But as I can see, the stack is right at the top of the address space, where does it have room to grow?”
Kernel: “Good observation! The stack is usually mapped at the higher address range and it grows by moving towards the lower address ranges. So, for example, if the stack pointer is currently 0x120008 and you push an 8 byte value on the stack, the stack pointer becomes 0x120000”
Alloca: “So the heap grows upward and the stack grows downward?”
Kernel: “Yes. The empty space between them is the buffer that lets both grow without colliding. In practice, a process runs out of one or the other long before they meet.”
Alloca: “Okay, I understand the layout now. But I’ve one final question about it, what is the need for such a layout? Why not simply store data anywhere you find space?”
Kernel: “Great question! There are two big reasons: performance and security. Which one would you like to hear about first?”
Alloca: “Let’s start with performance.”
Kernel: “Alright. Tell me, if you are reading a value from an array at index 5, what do you do after that?”
Alloca: “Well, I probably would read index 6, then 7, and so on? Most array processing is sequential like that.”
Kernel: “Exactly! And when you’re executing instructions in your code, you typically run them one after another, right? You’re not randomly jumping all over the place.”
Alloca: “Right, except for loops and function calls, it’s mostly sequential.”
Kernel: “Yes! This pattern of accessing nearby memory locations is so common that the hardware is designed around it. But, fetching data from physical memory is slow. Really slow. It can take hundreds of CPU cycles.”
Alloca: “That sounds terrible!”
Kernel: “It would be, if the CPU actually went to main memory for every single read. But it doesn’t. The CPU has a fast cache, smaller but much faster storage right on the chip. And this is the clever bit: when you read a value from memory, the hardware doesn’t just fetch that one value. It fetches an entire block around it, typically 64 bytes, called a cache line.”
Alloca: “So it’s betting that I’ll need the nearby data too?”
Kernel: “Precisely! And because of how you traverse arrays or execute sequential instructions, that bet pays off most of the time. The next value you need is already sitting in the cache, ready instantly. This is called spatial locality.”
Alloca: “Ah, so that’s why the organized layout helps! If my heap has all my data structures, and I’m traversing a linked list, the nodes are likely to be near each other in memory?”
Kernel: “Well, linked lists are actually a bad example, their nodes can be scattered all over the heap. But arrays, yes! And more importantly, think about your stack. When you’re executing a function, you’re constantly accessing its local variables. Because they’re all packed together in one stack frame, most of those accesses hit the cache.”
Alloca: “And the same applies to code in the text segment?”
Kernel: “Exactly. Your instructions execute sequentially, so the processor can even prefetch the next cache line before you ask for it. By keeping code separate from data, and keeping different types of data in their own regions, we maximize these cache-friendly access patterns.”
Alloca: “That makes sense! What about security? How does the layout help there?”
Kernel: “Let me ask you this: if an attacker managed to write arbitrary bytes into your heap, say through a buffer overflow bug, what’s the worst thing they could do?”
Alloca: “Um, corrupt my data structures? Make my program crash?”
Kernel: “That’s bad, but there’s something worse. What if those bytes they wrote were actually machine instructions? And what if they then tricked your program into jumping to that address?”
Alloca: “Oh no… then the CPU would execute their malicious code as if it were part of my program!”
Kernel: “Exactly. And without protection, they could also try to overwrite your actual code in the text segment, inserting a backdoor directly into your program.”
Alloca: “So how do we prevent that?”
Kernel: “By giving each segment permission bits. Think about what should be allowed for each segment. Should you be able to write to your code segment?”
Alloca: “No, the code is fixed! It shouldn’t change while the program runs.”
Kernel: “Right. So the text segment is marked read-only and executable: you can run code from it, but you cannot write to it. Now, what about your heap and stack?”
Alloca: “I need to read and write data there all the time. But I should never execute code from there, right?”
Kernel: “Perfect! The heap and stack are marked read-write but not executable. You can modify your data, but if someone tries to jump to an address in the heap and execute it, the processor will refuse and kill your process.”
Alloca: “So by separating code from data, we can enforce different permissions on each?”
Kernel: “Precisely. This is often called W^X protection (write XOR execute). Memory can be writable or executable, but not both. By organizing memory into distinct segments, we make this protection model clean and enforceable.”

Key Takeaway

The virtual address space is organized into several distinct segments:

Text (code) segment: The compiled instructions of the program. Loaded at startup, mapped read-only and executable. The process cannot write to its own code pages.
Data segment: Global and static variables that have been explicitly initialized. Size is fixed at link time.
BSS segment: Global and static variables that are zero-initialized. The binary stores no data for this region; the loader provides zero-initialized memory for it at startup.
Heap: The region for dynamic memory allocation (malloc/new). Starts just above the data/BSS segments and grows upward for small allocations; its upper boundary is called the program break (brk). Many allocators also use mmap directly for large allocations rather than growing the heap via brk.
Memory-mapped region: A large, flexible area in the middle of the address space used for shared libraries, file mappings, and anonymous large allocations. Libraries like libc are loaded here.
Stack: Holds the call frames of all currently executing functions. Starts near the top of the address space and grows downward. Each function call pushes a frame containing local variables, saved registers, and the return address; each return pops it.

Aside: Anonymous memory

Throughout the article, we will come across a term “anonymous memory”, it is important that we understand what it means.

The kernel manages two kinds of memories:

Anonymous memory: this is allocated using malloc or mmap with the MAP_ANONYMOUS flag. This is also the memory backing a process’s heap, stack and similar segments.
File-backed memory: this is the memory which is backed by a file. You normally create it using mmap and passing a file descriptor to it.

We will cover both of these in quite detail as we progress through the article, but having this common vocabulary will help us move faster.

How are Virtual Addresses Translated to Physical Addresses

Alloca: “I understand the layout. Code down here, stack up there. But these are all virtual addresses. How does a virtual address ever become real? I’m imagining you keep a table, virtual byte 0 maps to physical byte X, virtual byte 1 maps to physical byte Y, one entry for every address. Is that how it works?”
Kernel: “That’s the natural first thought. Let’s see what it costs. Your address space (the user-space half) is 128 TiB, that’s roughly 140 trillion bytes. At 8 bytes per table entry, a per-byte mapping table would take 1 PiB of storage per process. That’s impractical.”
Alloca: “So a per-byte table is out. But you do need a lookup of some kind.”
Kernel: “Yes, we do. But, instead of mapping individual bytes, we map fixed-size chunks. I divide your virtual address space into fixed-size chunks called pages, and I divide physical memory into same-sized chunks called frames. Each virtual page maps to one physical frame at a time. One table entry per page, not per byte. This way we don’t waste too much space maintaining the mapping itself.”
Alloca: “How large are these chunks?”
Kernel: “4 kilobytes. At that size, your 128 TiB address space divides into 2³⁵ pages.”
Alloca: “Wait, why 4 kilobytes specifically? Why not map smaller chunks like 1 kilobyte, or larger ones like 64 kilobytes?”
Kernel: “Good question! Let me ask you this: when you read a variable from memory, say an integer, do you usually read just that one value and nothing else nearby?”
Alloca: “Well, no. If I’m reading array[5], I probably read array[6] and array[7] soon after. And when executing code, I run instructions sequentially, one after another.”
Kernel: “Exactly! Memory accesses happen in clusters, spatial locality again. The hardware already exploits this with 64-byte cache lines; pages work the same way at a coarser scale. 4 KB is a sweet spot: large enough that related data usually falls within the same page, but also small enough that we don’t waste physical memory when only part of a page is touched.”
Alloca: “So 4 KB is a sweet spot between granularity and efficiency?”
Kernel: “Right. And because every page and every frame is exactly the same size, any free frame can back any page. It doesn’t matter where in physical memory that frame happens to sit.”
Alloca: “Okay, I understand the page size. But there is something that I still don’t get: you’re mapping an entire 4 KB page to an entire 4 KB frame. But, I have a specific address, and I want to read 8 bytes from it. How do you find out which virtual page that address belongs to, to get the corresponding physical frame?”
Kernel: “The answer lies in the virtual address itself. Think of it like a library call number. When a librarian gives you the number 3-07-42, you know immediately that the book is on floor 3, rack 07, shelf 42. The number encodes two things at once: which shelf unit to find, and where within that unit to look. A virtual address works the same way. It encodes which page the address falls in, and the byte position within that page.”
Alloca: “So the address itself tells you both the page and the position inside it?”
Kernel: “Yes. Every virtual address is implicitly two things: the virtual page number, given by the upper bits, and the page offset given by the lower 12 bits. 12 bits because 2¹² = 4096, one for every byte in a page. Say your address points 500 bytes into page N. When I map page N to physical frame M, your data is still 500 bytes in, because the frame is the same 4 KB size. The offset does not change during translation. So I look up the virtual page number in your page table, get back the physical frame number, attach the same offset, and that gives the physical address of exactly the 8 bytes you asked for.”
Alloca: “Okay, I understand that part. But something is still not clear. You said that my address space is 128 TiB. If there’s one page table entry per 4 KB page, that’s 2³⁵ entries. At 8 bytes each entry, that’s 256 GiB of page table. Per process. That’s not workable.”
Kernel: “Exactly, that’s the problem with a flat table. So let me ask you this: what if, instead of tracking every single page, we tracked which regions of your address space are in use?”
Alloca: “Regions? Like groups of pages?”
Kernel: “Yes. Think about your address space. You have code at the bottom, a heap above that, maybe some libraries in the middle, and a stack at the top. Most of the space between them is empty, right?”
Alloca: “Right, huge stretches of unused addresses.”
Kernel: “So what if I had a high-level index that just tracks which large regions are in use, and then within each of those regions, I have another index for smaller regions, and so on, until I get down to individual pages?”
Alloca: “Like… a tree structure? Where each level zooms in on a smaller portion?”
Kernel: “Precisely! It’s called a hierarchical page table. There are four levels. At the top level, there’s a table with 512 entries, and each entry represents 512 GB of your address space. If an entire 512 GB region is unused, that entry is just marked absent, no further tables are allocated for it.”
Alloca: “So you only allocate the deeper levels of the tree for the parts I’m actually using?”
Kernel: “Yeah. Each entry at the top level can point to a second-level table, which again has 512 entries, each covering 1 GB. Each of those can point to a third-level table covering smaller regions, and so on, until the deepest level maps to individual 4 KB pages.”
Alloca: “But wait, doesn’t having four levels still waste space? If I use just one page, don’t you still need entries at every level to reach it?”
Kernel: “Yes, but consider the scale. For that one used page, I need one entry in the top-level table, one second-level table with 512 entries, one third-level table with 512 entries, and one fourth-level table with 512 entries. That’s roughly 12 KB total. Compare that to a flat table: 2³⁵ entries times 8 bytes equals 256 GiB. I save a factor of 20 million.”
Alloca: “So the table itself only exists for the parts of my address space I’ve actually used.”
Kernel: “Correct!”

Aside: Page table level names difference between Linux and x86

The four levels of the page table hierarchy have different names depending on whether you’re reading Linux kernel source or Intel/AMD architecture manuals.

Table: Naming convention for page table levels in Linux vs x86 architecture

The x86 names are tied to the specific architecture. The Linux names are more generic and are used consistently across architectures that Linux supports, whether that’s x86-64, ARM64, or RISC-V, even when the underlying hardware has a different number of levels. Throughout this article we use the Linux kernel names: PGD, PUD, PMD, and PTE.

Alloca: “But how does a virtual address help you traverse this?”
Kernel: “It’s actually pretty clever. Your virtual addresses are 64 bits wide, but only 48 bits are used. Those 48 bits are split into five parts: four groups of 9 bits each, followed by a 12-bit offset. The first four groups are used one by one to step through each level of the page table tree, narrowing down to the right physical frame. The offset is then used to pinpoint the exact byte within that frame.”
Alloca: “What is the exact split of these bits?”
Kernel: “The first group (bits 47 down to 39) gives a number between 0 and 511, which I use as an index into the PGD. That entry points me to a PUD. I take the next group (bits 38 down to 30) and index into that PUD, which points to a PMD. I repeat this for the PMD and PTE levels.”
Alloca: “That leaves the bottom 12 bits, those act as offset within the page frame?”
Kernel: “Yes, once you reach the PTE and get the physical frame number, you combine it with those 12 bits to get the exact byte you want. 12 bits because 2¹² is 4096, the page size.”

Figure 2: The four-level page table hierarchy on x86-64. To translate a virtual address, four groups of 9 bits (i, j, k, l) are used as indices, one per level, to walk down the tree to the right page frame. The final 12 bits give the byte offset within that frame. Sub-tables are only created for parts of the address space that are actually mapped, so unused regions cost nothing.

Aside: 48-bit virtual addresses

On common 4-level x86-64 systems, virtual addresses are stored in 64-bit registers, but you should have noticed that only 48 bits participate in this address translation scheme, what about the top 16 bits?

The top 16 bits must be a sign-extension of bit 47: all zeroes for low-half user-space addresses, all ones for high-half kernel-space addresses. Such addresses are called canonical addresses. A non-canonical address faults before the normal page-table walk even completes. This is what creates the large unused gap between the low and high halves of the 64-bit virtual address space.

Recent x86-64 processors and Linux kernels also support 5-level page tables, which use 57 bits for address translation (adding a fifth level called P4D (Page 4th Directory) between the PGD and PUD). This provides 2⁵⁷ bytes (128 PiB) of virtual address space per process. The additional level uses bits [56:48] as an index, with bits [63:57] remaining as sign-extension of bit 56.

Alloca: “I see how the bits map to the levels. But who actually performs this translation? On every memory access, something has to look up these tables.”
Kernel: “A dedicated piece of hardware called the Memory Management Unit, or MMU. It intercepts every address you issue. You never see any of this; to you it appears as if you are reading directly from your virtual address.”
Alloca: “So the MMU does this lookup automatically on every memory access? How does it know where to start?”
Kernel: “The CPU has a register called CR3 that holds the physical address of your current PGD, the top-level table. I update it on every context switch so the MMU knows which process’s tables to use.”
Alloca: “And then it uses the bits from my address to walk through the levels?”
Kernel: “Yeah, the same bit fields we just covered. Bits [47:39] index into the PGD, [38:30] into the PUD, [29:21] into the PMD, and [20:12] into the PTE. That last entry gives the physical frame number, which the MMU combines with the 12-bit page offset to produce the physical address.”

Figure 3: The four-level page table walk on x86-64. The CPU register CR3 holds the physical address of the top-level table (PGD). Each level is indexed by 9 bits of the virtual address. The TLB caches completed walks; the four-level traversal only occurs on a TLB miss. How often that happens depends heavily on access patterns.

Alloca: “But this means every memory access now requires four table lookups. That’s four extra memory reads just to translate my address. Doesn’t that make every memory access slower than it should be?”
Kernel: “It would be, if we had to walk all four levels every time. But the MMU has a small, dedicated hardware cache called the Translation Lookaside Buffer, or TLB. Every time a page table walk completes successfully, the result is stored in the TLB: ‘virtual page P maps to physical frame F.’ The next time you access the same page, the MMU checks the TLB first. If it’s there (a TLB hit), the translation completes in a handful of cycles, with no table walking at all.”
Alloca: “And how often does that happen?”
Kernel: “Programs that reuse the same memory regions repeatedly, such as tight loops, frequently executed functions, reused buffers, tend to stay within a small working set of pages, keeping the TLB warm and page walks rare. But that is not a given. Access patterns matter a great deal.”

Aside: Working set

A process’s working set is the subset of its virtual pages that are actively needed during a given window of execution. It’s not a fixed quantity, it shifts as the program moves through different phases. A tight loop over a small array has a tiny working set: just the pages holding the loop instructions and the array. A database engine scanning a large table has a much larger one.

The working set matters for two hardware structures:

TLB: If the working set fits within the TLB’s capacity (typically a few hundred to a few thousand entries), translations stay cached and page walks are rare. If the working set exceeds TLB capacity, there are larger number of TLB misses which may cost performance.
Physical RAM: If the working set fits in RAM, pages stay resident. If it doesn’t, the kernel must evict pages to swap and reload them on demand, which is a far more expensive operation (we cover eviction and swap later in the article).

Keeping the working set small and stable is one of the most effective things a program can do to improve memory performance.

Key Takeaway

Virtual memory operates at the granularity of pages (4 KB chunks of virtual address space) that map to frames (4 KB chunks of physical memory). Each virtual address encodes two pieces of information: the virtual page number (upper bits) and the page offset (lower 12 bits). The offset stays the same during translation; only the page number changes to a frame number.

On x86-64, the kernel uses a four-level hierarchical page table to perform this mapping. The structure has four levels named PGD (Page Global Directory), PUD (Page Upper Directory), PMD (Page Middle Directory), and PTE (Page Table Entry). A 48-bit virtual address is divided into four 9-bit index fields (one per level) plus a 12-bit offset, as shown in Figure 2. The hierarchy is sparse: only the portions of the address space actually in use require allocated page table structures, avoiding the 256 GiB overhead of a flat table.

Because each virtual page is mapped independently, there is no requirement that consecutive virtual pages land in consecutive physical frames. A process’s pages can be scattered anywhere in physical RAM, interleaved with frames from other processes, yet the process always sees a clean, contiguous address space. Figure 4 shows this concretely.

The Memory Management Unit (MMU) performs address translation in hardware. On x86, the register CR3 holds the physical address of the current process’s PGD. On every memory access, the MMU first checks the translation lookaside buffer (TLB) to see if the translation is already cached. If not, the MMU performs a full page table walk to do the translation and then caches the translation in the TLB.

Figure 4: Each process has its own virtual address space, but the page table maps virtual pages to physical frames that may be anywhere in RAM. Adjacent virtual pages can land in widely separated frames, and frames from multiple processes are interleaved in physical memory. The page table is what makes this invisible to the process.

Memory Protection via Permission Bits

Alloca: “Earlier you told me that my code segment is read-only. I can execute it but not write to it. But now that I understand the page table, I don’t see what actually enforces that. My code pages have entries in the page table just like everything else. What stops me from writing to them?”
Kernel: “Each page table entry carries more than just the frame number. It also holds permission bits. The writable bit says whether you can write to that page, if it is 0, the MMU refuses the write and faults: it stops the access mid-flight and signals me to handle the situation. The executable bit says whether you can run code from it. When I set up your code segment I mark those pages as executable but not writable. Your data and heap are writable but not executable. The MMU checks these bits on every access.”
Alloca: “What happens when it faults? Say I try to write to one of my code pages?”
Kernel: “I get called to handle it. A permission violation is almost always a bug or a security attack, so I typically terminate you.”
Alloca: “Got it. Are there other kinds of bits apart from permission bits?”
Kernel: “Yes, a very important one that you should know about. There is a present bit in every entry, at every level of the hierarchy. If it is 0, the walk stops there and the CPU faults. But a not-present entry doesn’t necessarily mean something went wrong. It might just mean that I haven’t allocated a physical frame for that page yet, or that the page has been evicted to disk.”
Alloca: “So the permission bits enforce boundaries between code and data, and the present bit tells you whether a page is backed by physical memory at all.”
Kernel: “Exactly!”

Key Takeaway

Each page table entry contains not just a frame number but also several permission bits that the MMU enforces on every memory access:

Present bit: Indicates whether the page is currently backed by a physical frame. If 0, the page table walk stops and the CPU raises a page fault. A not-present page doesn’t always signal an error; it might mean the kernel has promised the address range but hasn’t yet allocated physical memory for it (demand paging, covered in the next section). It might also mean that the physical frame was swapped to disk and reused by another process.
Writable bit: Controls write permission. If 0, any write attempt triggers a fault. Used to make code pages read-only and to implement copy-on-write (covered later).
Executable bit (or NX/XD bit): Controls execution permission. If the page is marked non-executable, the processor refuses to fetch instructions from it. Code pages are marked executable; data, heap, and stack pages are marked non-executable to prevent code injection attacks.

The MMU checks these permission bits on every memory access, before the access completes. Permission violations typically indicate bugs or security violations and usually result in the kernel terminating the faulting process. This hardware-enforced separation between code and data is a foundational defense against many classes of exploits.

Demand Paging

Some time passes. Alloca has been running her code and has grown more comfortable in this world. But now she needs more memory, she is about to process a large dataset and needs space to store intermediate results.

She does what any process would do: she makes a system call asking for memory. A new region appears in her address space. Kernel hands her an address: 0x55a3c2f00000. She immediately goes to write her first value there.

And then something strange happens. Time seems to stop for a fraction of a moment. And then it starts again, as if nothing had occurred. Her write went through. But something had happened, she had simply not noticed.

Alloca: “That was odd. Did I just… stutter?”
Kernel: “You did. You triggered a page fault. Don’t worry, I took care of it.”
Alloca: “A page fault? What’s that? And what did you take care of?”
Kernel: “When I gave you that address, I didn’t actually back it with physical memory. I recorded the promise that this range of virtual addresses belongs to you, but I didn’t go and find a physical frame to put behind it.”
Alloca: “You gave me an address without any memory behind it? That sounds like fraud.”
Kernel: “It’s efficiency. Think about it: you might ask for a hundred megabytes and only use ten. If I allocated a physical frame for every page you asked for, I’d be wasting most of physical memory on pages that never get touched. So instead, I wait. When you actually try to access a page for the first time, the MMU looks up that address in your page table and finds the present bit set to zero. No physical frame is mapped. The MMU raises a trap (a page fault) and control transfers to me.”
Alloca: “But how did you know my access was valid? Maybe I was accessing some address I had no right to. How do you tell the difference?”
Kernel: “When I gave you that memory region, I recorded a note called a virtual memory area, or VMA. It says: ‘virtual addresses from X to Y are promised to Alloca, with these permissions.’ The VMA is not a page table entry. It’s a higher-level record of intent that I maintain separately.”
Alloca: “So you have two different data structures tracking my memory?”
Kernel: “Yes. The VMA describes what address ranges are valid for you to access. The page table describes which of those valid pages are currently backed by physical frames. When you were created, I set up VMAs for your code segment, your data segment, your stack. Each one records an address range and what you’re allowed to do there: read, write, execute. Later, when you call malloc or mmap, I create a new VMA for that allocation. But I don’t immediately create page table entries for it.”
Alloca: “So when the MMU finds a missing page table entry for an address, it triggers a page fault?”
Kernel: “Yes. When a page fault fires, I have to handle it. I first check whether the faulting address falls inside a valid VMA. If yes, the access is legitimate. I just haven’t backed it with a physical frame yet. If the address is outside any VMA, you’ve wandered somewhere you were never given. That’s a segmentation fault, and I terminate you.”
Alloca: “So the VMA list is your record of promises, and the page table is the record of fulfilments.”
Kernel: “Well put. Now, once I confirm the fault is legitimate, I find a pre-zeroed physical frame, write a new entry into your page table pointing to that frame, and resume your execution. The CPU retries the faulting instruction and your write goes through.”

Figure 5: A page table entry before and after a demand paging fault. The kernel changes the present bit from 0 to 1 and fills in the physical frame number (PFN).

Alloca: “Wait. Why did you zero it out? Couldn’t you just give me the frame as-is?”
Kernel: “Absolutely not. Physical frames get reused. That frame might have previously held data from another process. If I handed you that frame without clearing it first, you could read another process’s secrets just by reading uninitialized memory. The zero-fill guarantee is a security invariant: you will never see data you didn’t write yourself.”
Alloca: “That’s reassuring. But what if there are no free frames? What if physical memory is full?”
Kernel: “It happens more often than you’d expect, and dealing with it changes what the present bit in a PTE can mean.”

Figure 6: The demand paging lifecycle. Step 3 (checking the VMA) is what distinguishes a legitimate first access from an invalid access. Without a matching VMA, the kernel delivers a segmentation fault instead of allocating a frame.

Aside 1: How the stack grows using demand paging

Remember, when talking about address space layout, we said the stack grows downward. That growth is demand-driven too. The kernel marks the stack VMA as growable, but it does not map every possible stack page upfront. When the stack pointer moves into the next valid page below the current stack, the access faults. Because the faulting address is just below the current stack bottom and the stack VMA is marked as growable, the kernel extends the VMA downward by one page, allocates a frame, and resumes execution. From Alloca’s perspective the stack just grew silently.

Two mechanisms prevent this from continuing forever. First, the kernel enforces a maximum stack size (on Linux, set by ulimit -s, defaulting to 8 MB). The stack VMA will not be extended past that limit. Second, below the maximum stack limit sits a guard page: a single page that is deliberately left unmapped, no VMA covers it. If the stack pointer jumps far enough to land in or past the guard page (due to deep recursion, a large stack-allocated array, or a corrupted stack pointer), the fault finds no covering VMA. The kernel treats that as an invalid access and delivers SIGSEGV.

The guard page is what turns a silent runaway stack into a detectable crash. Without it, the stack could silently overflow into the memory-mapped region below it and corrupt library or heap data before anything notices.

Aside 2: Memory overcommit: a consequence of demand paging

Demand paging creates an interesting situation: if the kernel only allocates physical frames at first-access time, then malloc(10GB) on a machine with 4 GB of RAM will succeed (at least initially). The kernel records the promise in a VMA and returns immediately. No frames are allocated. This is called overcommitting memory: the total size of all VMAs across all running processes can far exceed the amount of physical RAM plus swap.

The kernel’s bet is statistical. In practice, most allocated memory is never fully touched. A process might allocate a large buffer “just in case” and only ever write to a fraction of it. A JVM might reserve a large heap up front but populate it lazily. Across hundreds of processes, the working sets sum to much less than the total committed virtual memory, and the system runs fine.

The bet occasionally goes wrong. When too many processes start faulting in pages simultaneously, memory pressure spikes, and the kernel runs out of physical frames. At this point it invokes the OOM killer (Out-Of-Memory killer): a kernel subsystem that scores each process by its memory consumption, age, and other heuristics, then kills the highest-scoring one to reclaim its frames.

You can observe overcommit and OOM events on Linux:

# How much virtual memory is committed system-wide (in kB)

grep CommitLimit /proc/meminfo   # kernel’s ceiling: overcommit_ratio × RAM + swap

grep Committed_AS /proc/meminfo  # total virtual memory promised to all processes

# See if the OOM killer has fired recently

dmesg | grep -i “oom\|killed process”

journalctl -k | grep -i oom

The kernel’s overcommit policy is tunable via /proc/sys/vm/overcommit_memory:

0 (default) uses heuristics
1 always allows any allocation
and 2 caps total committed memory at overcommit_ratio × RAM + swap and begins refusing malloc calls that would exceed it.

Key Takeaway

When a process allocates memory, whether by calling malloc, growing its stack, or explicitly requesting memory via mmap, the kernel does not immediately back every page of that allocation with a physical frame. Instead, it creates a Virtual Memory Area (VMA) in the process’s memory descriptor: a record that says “this range of virtual addresses is valid and belongs to this process, with these permissions.” The page table entries for these pages are left absent (present bit = 0).

The VMA and the page table serve different roles:

The VMA is the kernel’s record of intent: what address ranges the process is allowed to access.
The page table is the record of reality: which virtual pages are currently backed by physical frames.

The first time the process reads or writes any address in an allocated-but-unmapped range, the MMU finds a page table entry with present=0 and raises a page fault, a CPU exception that transfers control to the kernel. The kernel’s page fault handler:

Looks up which VMA contains the faulting address. If none, the access is invalid and the kernel delivers a segmentation fault, terminating the process. Otherwise, it continues:
Allocates a free physical frame.
Zero-fills that frame (the zero-fill guarantee, required for security, ensures the process never sees data from a previous owner of that frame).
Installs a new page table entry pointing to that frame, with the present bit set.
Returns from the exception, causing the CPU to retry the faulting instruction.

From the process’s perspective, execution pauses for a few microseconds and then continues as if nothing happened. This mechanism is called demand paging: physical memory is allocated on demand, at the moment of first access, rather than speculatively at allocation time.

The fault described above requires no disk I/O: it is called a minor page fault. Minor faults cover any fault the kernel can resolve entirely in memory. This includes zero-fill for pages that aren’t backed by any file, but also cases where the data is already resident somewhere (in the page cache, or shared from another process) and just needs a PTE installed. There is a second kind of fault called major fault, that does require reading from disk. We will get to that next.

A side effect of demand paging is that physical frames are allocated one by one, on demand, from wherever free memory happens to be. There is no requirement that consecutive virtual pages land in consecutive physical frames. A process’s stack might occupy frames scattered across RAM, interleaved with frames belonging to completely different processes. The page table is what makes this invisible: it maps each virtual page independently, so the process always sees a clean, contiguous virtual address space regardless of where its frames physically reside.

Prefer reading this as a polished PDF? I’ve prepared a beautifully typeset PDF version for offline reading and reference. Buying it is another way to support the time that went into this article.

Get PDF

When Physical Memory Runs Out: Swap and the Dual Meaning of the Present Bit

Alloca: “So what happens when there is not enough free physical memory left to allocate?”
Kernel: “Let me show you. Let’s say that I need to allocate a frame for you, but they are all taken. So I must evict a page from somewhere, I look for a page that hasn’t been accessed recently. It could be from another process, or even one of your own pages. Once I find the page to evict, I write its contents to disk to a reserved area called swap space. Then I reclaim the frame and give it to you.”
Alloca: “And what happens if the process that owned that page tries to access it again?”
Kernel: “Before I give that frame to you, I update the process’s page table. I locate the PTE that points to that frame, clear its present bit to 0, and store the swap location in the remaining bits of the entry. The hardware never looks at those bits when present is 0, but I do when handling the page fault.”
Alloca: “So when that process touches the page again…”
Kernel: “The MMU sees the present bit is zero in the PTE, and it raises a page fault bringing me into action to handle it. My fault handler follows the same entry point as always: check the VMA first. In this case, because the page was swapped, its VMA must exist, so the fault handler moves forward and checks the PTE next. It finds the swap coordinates in the non-present bits, uses those to read the data from the disk, and loads it into a fresh frame. After that, it reinstalls the PTE with present=1. Once the page fault handler finishes, I resume the process and it retries the instruction that triggered the fault and this time it succeeds. It never knew the page had left.”

Aside: Minor vs Major Page Fault

Earlier in the demand paging section, we talked about minor page faults. Those kind of page faults don’t involve disk I/O and are handled directly in memory. For example, when malloc allocates more pages, the kernel simply creates the VMA, and allocates the physical frames on demand when the page fault occurs.

The page fault that we discussed above when a process tries to access a page that has been swapped to disk is a major page fault because handling it requires disk I/O.

Alloca: “So present=0 in a PTE always means that the data is in the swap?”
Kernel: “No. Swap is one destination, but it’s not the only one. A non-present PTE can point to data that lives somewhere other than swap space.”
Alloca: “Where else can it go besides swap?”
Kernel: “A file. Not every page comes from memory you allocated with malloc or grew from the stack. Some pages map directly to content stored in a file on disk.”
Alloca: “How does that work?”
Kernel: “You use the mmap system call. It lets you map a file into your address space. When you do that, I create VMAs for the mapped range, but I leave the PTEs absent, just like with malloc.”
Alloca: “So on first access?”
Kernel: “This time again the MMU sees an absent PTE and raises a page fault. But handling this page fault is different from how I handle a page fault for a swapped page, or how I handle allocation of new memory like what we discussed when talking about demand paging earlier.”
Alloca: “What changes?”
Kernel: “The first step is the same, I check the VMA to confirm this is a valid region. But what happens next depends on the type of mapping.”
Alloca: “What’s different about a file-backed mapping?”
Kernel: “For anonymous mappings, a fault means either a fresh allocation where I hand you a new zero-filled frame, or a swap restore, where I read the page back from disk using the swap coordinates stored in the PTE. For file-backed mappings, there is no swap entry. Instead, the VMA itself tells me which file and which block of that file to read. I load that block into a frame, install it in the page table, and resume you.”
Alloca: “So at the PTE level, present=0 is just a signal: data is not in RAM. But the place to find it depends on what kind of mapping this is?”
Kernel: “Precisely. For anonymous memory pages that have been swapped, the non-present PTE can carry swap coordinates. For a file mapping that has not been loaded yet, I usually use the VMA to find the file and offset. Either way, the fault handler has enough information to reconstruct the page.”

Key Takeaway

When physical memory runs out, the kernel must reclaim frames. It selects pages that have not been accessed recently and evicts them. For anonymous pages (heap, stack, malloc), there is no file to fall back on, so the kernel writes the page’s contents to swap space on disk before freeing the frame. It then updates the PTE: the present bit is cleared to 0, and the remaining bits are repurposed to store swap coordinates (device number and page offset). These bits are ignored by the hardware; they exist solely as a private record for the kernel’s own fault handler.

When the evicted page is next accessed, the MMU finds present=0 and raises a major page fault. The fault handler reads the swap coordinates from the PTE, loads the page from disk into a fresh frame, reinstalls the PTE with present=1, and resumes the process.

However, a page fault for a file-backed mapping is handled slightly differently. Here, the VMA contains information about the file and the offset in the file needed to populate the frame.

Together, anonymous and file-backed mappings cover all the cases a fault handler encounters. Two questions decide which path it takes:

What type of mapping is this? Anonymous memory has no file behind it. File-backed memory does.
Why is the page absent? A first-access fault (i.e., the frame was never allocated), or the page was evicted due to memory pressure and now being accessed again.

Figure 7 below shows all four combinations and how the fault handler resolves each

Figure 7: The four paths the kernel takes when resolving a page fault, organized by mapping type (rows) and reason for absence (columns). An anonymous first-access fault is the only minor fault, the kernel zero-fills a fresh frame with no disk I/O. All other cases require reading from swap or from a file and are major faults. For first-access faults (left column), no page table entries may exist yet, and the fault handler allocates the intermediate levels (PGD, PUD, PMD) and the PTE on demand. For evicted or dropped pages (right column), the intermediate levels already exist from when the page was first loaded; only the PTE was updated when the page left RAM.

Aside: Pinned memory and GPU data transfers

Everything discussed so far assumes the kernel is free to evict any page when memory pressure demands it. There are cases where that is unacceptable. Pinned memory (also called page-locked memory) is memory that the kernel is prohibited from swapping out. A process can pin a region by calling mlock(), after which the kernel guarantees that the underlying physical frames will not be moved or reclaimed for as long as the lock is held.

The most common reason to pin memory today is GPU data transfers. DMA (Direct Memory Access) engines, which move data between host RAM and GPU memory without CPU involvement, require that the source or destination buffer remain at a fixed physical address for the duration of the transfer. If the kernel were to evict a page mid-transfer and reassign the frame, the DMA engine would read or write the wrong physical location. Pinning prevents this by fixing the physical address in place.

This is why AI training frameworks pin host memory for input batches. In PyTorch, tensor.pin_memory() and the pin_memory=True option on DataLoader call mlock() under the hood. With pinned buffers, the CUDA driver can initiate DMA transfers directly from host RAM to GPU memory without an intermediate copy, and it can overlap those transfers with GPU computation. On large models trained on high-bandwidth interconnects (NVLink, PCIe 5.0), this overlap between data loading and compute is a significant contributor to overall throughput.

The trade-off is that pinned memory is a scarce resource. Because pinned pages cannot be reclaimed, overusing it reduces the memory available for the page cache and other processes, increasing the risk of swap pressure elsewhere.

Copy-on-Write and Fork

Alloca has been given a large job: process a large dataset. She needs help to do this in a quick amount of time.

Alloca: “I wish I had a copy of me that could share this workload.”
Kernel: “You can do that, just use the fork() system call.”
Alloca: “How does that work?”
Kernel: “When you call fork(), I make a new process which is almost an identical copy of you. I give this process the same code as you, a copy your file descriptor table and even your memory.”

Alloca calls fork() and creates a new process called “Forka”. She inherits everything Alloca had.

Forka and Alloca start to do their work. Soon Alloca tries to perform a memory write. The familiar brief pause. Then it passes.

Alloca: “That pause. What was that?”
Kernel (appearing): “Another page fault.”
Alloca: “Another page fault? But the page is present, I’ve been reading from it just fine.”
Kernel: “It’s present, yes. But I marked it read-only, and you tried to write. That’s what triggered the fault.”
Alloca: “Wait, why did you mark it read-only? That memory was clearly meant for both reading and writing.”
Kernel: “It was an optimization I did when creating Forka. Let me explain why I did it.”
Alloca: “Please.”
Kernel: “I created Forka by giving her an independent copy of your memory. The simple approach is to copy every page immediately. But you have gigabytes of heap, and most of it she may never write to. Copying all of it upfront would waste a lot of time, and also make fork extremely slow. So instead, I gave Forka new page tables that initially point at the same physical frames as you. Which means that both of you are sharing the same frames. But this only works as long as both of you are just reading those frames. When either of you need to write to one of these shared pages, page fault occurs and I give the writing process a private copy of that frame. This particular optimization is also called copy-on-write (CoW).”
Alloca: “So the read-only marking is how you detect that moment.”
Kernel: “Precisely. Your write triggered a fault, I caught it, confirmed this was a copy-on-write page, and handled it: I allocated a fresh frame, copied the 4 KB into it, updated your PTE to point to the new frame with write permission restored, and resumed your write. Forka’s mapping is untouched.”
Alloca: “And now we each have our own copy of that page?”
Kernel: “Yes. That page has been copied on write. But only that page. All the pages you haven’t written to yet are still shared. If you never write to a page, it stays shared forever, zero copies made.”
Forka: “What if my parent exits before I write to a page?”
Kernel: “I take care of that by tracking reference and mapping state for each physical frame. When your parent exits, I remove its mappings. The next time you write to a page, if I can see that the page is no longer shared, I can skip the copy and simply restore write permission on your existing PTE. There’s no one left to protect.”

Figure 8: Copy-on-write after fork(). Initially, both page tables point to the same physical frames (top). After Alloca writes to page A, the kernel allocates a new frame (19), copies the contents, and updates only Alloca’s PTE to point to the new frame. Forka’s PTE still points to the original frame and remains read-only; the kernel will restore write permission on Forka’s next write fault without needing to copy, because the frame is no longer shared.

Aside: fork + exec: why process creation is cheap

A common Unix pattern is to call fork immediately followed by exec to load and execute a new program. exec discards the child’s entire address space and builds a fresh one for the new program. For example, this is how the shell works whenever you execute a command.

For this reason fork needs to be cheap and one way to achieve that is by avoiding the copying of parent’s memory pages until it is really needed.

Key Takeaway

fork() creates a new process (the child) that is an exact copy of the parent at the moment of the call. Naively, this would require copying every byte of the parent’s virtual memory, a multi-gigabyte operation for large processes. Copy-on-write (COW) makes fork() efficient by deferring that copy until it is actually necessary.

When fork() is called:

The kernel allocates a new process descriptor for the child.
The kernel creates a new set of page tables for the child, initially pointing to the same physical frames as the parent.
For every private writable mapping, the kernel marks the entry as read-only in both parent and child. Read-only pages (code) are shared as-is, they were already protected.

The kernel tracks reference and mapping state for each physical frame. After a fork, private pages that were writable in the parent are now mapped by both processes, so their state records that they are shared.

When either process subsequently writes to a COW-protected page, the MMU detects a write to a read-only PTE and raises a protection fault. The kernel’s COW handler:

Checks whether the page is still shared. If it is, a copy is needed. If the kernel can determine the faulting process is now the only relevant owner, it can simply restore write permission without copying.
If a copy is needed: allocates a new frame, copies the contents, updates the faulting process’s PTE to point to the new frame with write permission. The other process’s PTE is left pointing to the original frame, still read-only.

Memory-Mapped Files

Several cycles pass. Alloca is trying to analyze a large log file. She has been doing it the obvious way, calling read() in a loop, filling a buffer, processing the buffer, repeat. Kernel notices this and wanders over.

Kernel: “You know there’s a better way to do that.”
Alloca: “I’m reading a file. What better way is there?”
Kernel: “Instead of reading into a buffer, let me map the file directly into your address space. You access it like regular memory: just use a pointer, and I’ll handle getting the data to you.”
Alloca: “You mean I can read a file with a pointer? No read() calls at all?”
Kernel: “Exactly. Call mmap(). Give me the file descriptor, the length, and some flags. I’ll create a new VMA in your address space (a memory-mapped region). Then you can read from or write to addresses in that region just like regular memory, and I’ll give you the file’s contents.”
Alloca does it. She gets back an address, 0x7f4b00000000. She reaches out to read the first byte at that address.
And the pause happens again. A little longer this time.
Alloca: “Longer pause. What was that?”
Kernel: “A major page fault. When you called mmap(), I didn’t actually load any of the file data into memory. That file could be gigabytes in size, and I have no idea which parts you’ll actually access. So I just created a VMA for that address range and left the page table entries absent. The first time you accessed that page, the MMU found present=0, trapped to me, and I had to read it from disk.”
Alloca: “So mmap is also lazy?”
Kernel: “That’s right. Demand paging works for files too. Now, notice where I put the data after reading it from the disk.”
Alloca: “Where?”
Kernel: “In the page cache. This is a pool of physical frames I use to cache file data. When a file page is read (whether via read() or mmap()), it lands in the page cache. For your mmap access, once the data was in the page cache, I installed a page table entry pointing directly to that page cache frame. Your virtual address now directly maps to the physical frame that holds the file data.”

Aside: The page cache is not reserved memory

A common misconception is that the page cache is a reserved pool of memory, it’s not. It is simply the set of physical frames that the kernel is currently using to hold file data. When an application needs more memory and there are no free frames, the kernel can reclaim clean page-cache frames instantly, because the file on disk is already the backing copy. This is why a system that looks nearly full of “used” memory can still allocate freely: much of that “used” memory is reclaimable cache, not locked-in application data.

Alloca: “So I’m reading the file’s data directly from the page cache, through my page table?”
Kernel: “Yes. No intermediate user-space buffer copy. Now compare that to what happens when you use read() instead. I still bring the file data into the page cache, usually by DMA from the storage device into memory. But then read() copies the data from the page cache frame into your user-space buffer. That page-cache-to-user-buffer copy is the extra step that mmap() avoids.”

Aside: What is DMA (Direct Memory Access)?

Normally, when a CPU wants data from a storage device or network card, it would have to sit in a loop reading bytes, which is an expensive waste of cycles. DMA is a hardware mechanism that lets peripheral devices transfer data directly into main memory (RAM) without CPU involvement.

In this scheme, the kernel and device driver submit an I/O request that describes the target memory pages and the storage range. The storage controller uses DMA to transfer data directly into those pages and interrupts the CPU when the transfer is done. The CPU is free to do other work the entire time.

Alloca: “And mmap() avoids that second copy because I access the data directly through the mapped address. But what happens if you evict the page cache frame while it’s mapped?”
Kernel: “Before I can reclaim that frame, I first remove the page table entry pointing to it. The VMA remains intact, so the next time you access that address the MMU finds no mapping, faults, and I reload the data. From your perspective the mapping is seamless; you never hold a dangling pointer.”

Figure 9: read() vs. mmap() I/O paths. With read(), data is brought from disk into the page cache and then copied into the process’s user-space buffer. With mmap(), the process’s PTE points directly into the page cache, eliminating that page-cache-to-user-buffer copy. The trade-off is that mmap() pays through page faults and page-table management instead of explicit read calls.

Alloca: “So should I always use mmap() for file I/O? Avoiding that user-buffer copy sounds like an obvious win.”
Kernel: “Not always. mmap() removes one cost, but it introduces others. It trades explicit I/O and copying for page faults, page tables, TLB pressure, and different failure modes. Whether that trade is good depends on the access pattern.”

Aside: mmap() is not automatically faster

The first access to a cold mapped page is still a page fault. The fault enters the kernel, locates the VMA, finds or reads the page cache page, installs a PTE, and resumes the faulting instruction. If you scan a huge file once, you may take one fault per 4 KB page, and those faults can dominate the page-cache-to-user-buffer copy you avoided.

read() and mmap() also expose different shapes of work. With read(), user space usually asks for a large buffer at a time, maybe 64 KB, 256 KB, or more. The kernel copies a contiguous chunk into that buffer and can issue readahead based on the file access pattern. With mmap(), readahead can happen too: when a fault reveals sequential access, the kernel may read surrounding file pages into the page cache, and may map nearby already-cached pages around the fault. But the control flow is still implicit and fault-driven. Cold pages still need faults to install mappings.

Mappings also consume page table memory, create TLB pressure, and may trigger TLB shootdowns when unmapped or when permissions change. Error handling is different too: if another process truncates a mapped file and you later touch a page beyond the new end, the kernel may deliver SIGBUS. With read(), you usually see an error return or a short read instead.

So mmap() is often attractive when access is random, repeated, shared across processes, or naturally pointer-based. read() is often competitive or better for simple sequential streaming, especially with large buffers. “Zero-copy” is not the same as “free”; the only reliable answer for performance-sensitive code is to measure the actual workload.

At that moment, Forka wanders over. She too needs to read the same log file.

Forka: “I’m going to mmap that same file. Same one you’re using, Alloca.”

Forka calls mmap(). She accesses the same page Alloca just read. But this time there is no pause.

Forka: “That was fast. Why no pause this time?”
Kernel: “Because that page is already in the page cache, it was loaded when Alloca accessed it. I just gave your page table an entry pointing to the same physical frame. You’re both reading from the same physical bytes. No disk I/O. No copy. Nothing moved.”
Alloca: “Wait, we’re both pointing at the same physical frame? So if I write to my mapped region, does Forka see it?”
Kernel: “That depends on a flag you passed to mmap(). With MAP_SHARED, your write goes directly into the shared page cache frame, so yes, Forka sees it. With MAP_PRIVATE, your write triggers a COW fault and you get a private copy, same as after fork(). The file is never touched.”
Alloca: “And if I use MAP_SHARED, when does the change actually reach disk?”
Kernel: “It happens asynchronously. But, if you need to guarantee it has been written to disk, you call msync() or fsync().”

Figure 10: MAP_SHARED vs. MAP_PRIVATE write semantics. With MAP_SHARED, writes go into the shared page cache and are flushed to disk asynchronously. With MAP_PRIVATE, the first write triggers a COW fault; the process gets a private copy that diverges from both the file and other processes.

Key Takeaway

mmap() is a system call that can be used to map a range of bytes from a file directly into a process’s virtual address space, creating a new VMA backed by the file. Subsequent reads and writes to that virtual address range behave exactly like memory accesses: the kernel’s page fault machinery handles loading data from disk on demand.

The central abstraction is the page cache: a kernel-managed pool of physical frames that holds recently accessed file pages. In the normal buffered-I/O path, file access via read(), write(), and mmap() goes through the page cache. The difference is how user space reaches those bytes:

Table: read() vs mmap() paths

The reason read() copies into a user buffer is ownership. The caller receives bytes placed in memory it fully controls. Once the call returns, the kernel can evict or reuse the underlying page cache page without affecting the caller’s data.

With mmap(), the kernel abstracts away the complexities of memory through the page table: if a mapped page is evicted, the PTE is marked absent, the next access faults, and the kernel reloads the data transparently.

Aside: Bypassing the page cache using direct I/O

By default, ordinary read(), write(), and mmap() file access go through the page cache. File data gets cached in kernel-managed page cache first, and either gets copied to a user buffer (read()), copied from a user buffer (write()), or mapped directly into the process (mmap()). This is buffered I/O, and it is the normal path.

There is another option: open a file with O_DIRECT. This asks the kernel to transfer file data directly between the storage stack and your user-space buffer, bypassing the normal page-cache data path. This sounds appealing for cases when you want to avoid kernel managed page-cache and have a caching layer in the application itself. But it comes with its own constraints. The buffer address, I/O length, and file offset often need to satisfy filesystem/device alignment requirements, commonly 512 bytes or 4 KB, though the exact rules vary.

The reason anyone uses O_DIRECT is control. Database engines are a famous example that commonly use this. These systems do sequential scan of data while processing queries. When using buffered I/O, the page cache gets filled with intermediate data that the database engine is not going to need in the near future, but this may result in the eviction of the useful pages the database may need soon. To gain control of over this, databases implement their own buffer pools in user space, and disable the use of page cache via direct I/O.

The tradeoff with using direct I/O is that you bypass the page-cache machinery that normally provides readahead, dirty-page buffering/writeback, and shared cached file pages between processes. You are now responsible for your own buffering, I/O sizing, alignment, and scheduling strategy. For most applications, buffered I/O is the right choice. O_DIRECT is a tool for workloads that already implement their own caching and need tighter control over the kernel’s caching behavior.

Anonymous, File-Backed, and Shared Memory

Alloca now understands that some pages come from files and some pages come from nowhere at all, beginning life as zero-filled frames. But she is still missing a vocabulary for the different kinds of memory she has been using.

Alloca: “I keep hearing different names for memory: anonymous memory, file-backed memory, shared memory. Are these different mechanisms, or just different names for pages?”
Kernel: “They are categories of mappings. Let me explain this to you systematically.”
Alloca: “Sure!”
Kernel: “By now you must have understood that VMA is a key structure behind how I manage virtual memory. Now, every VMA tells me two things about the mappings: where does the data come from, and who can observe writes to it?”
Alloca: “Let’s start with where the data comes from.”
Kernel: “There are two possibilities. The data can either come from a file, like when you mmap a file, and that results in what I call as file-backed mappings. The second possibility is that the data is from anonymous memory with no file backing it. For example, your heap and your stack regions are anonymous. You can allocate anonymous memory using mmap as well by using the MAP_ANONYMOUS flag.”
Alloca: “Understood. What is the second thing the VMA tells you?”
Kernel: “It tells me about who can observe writes to that mapping. A mapping can be private or shared. With a private mapping, your writes are yours alone. If the mapping began from a file, your first write usually triggers copy-on-write and creates an anonymous private page. The file is unchanged. With a shared mapping, multiple processes can map the same underlying object and observe each other’s writes through those mappings.”
Alloca: “So file-backed versus anonymous tells us where the contents come from, and private versus shared tells us who sees writes.”
Kernel: “Exactly.”

Figure 11: Virtual memory mappings can be understood along two independent axes: where the contents come from, either anonymous memory or a file, and who can observe writes, either only the current process or other processes sharing the same mapping.

Key Takeaway

Virtual memory mappings can be classified along two axes:

Anonymous memory: Memory with no ordinary file behind it. Heap, stack, and MAP_ANONYMOUS mappings are common examples. New anonymous pages are zero-filled on first touch. If modified anonymous pages must be evicted, they need swap because there is no file to reload them from.
File-backed memory: Memory whose contents come from a file. Executable code, shared libraries, and file mappings are examples. Clean file-backed pages can be dropped and later reloaded from the file. Dirty file-backed pages must be written back before reclaim.
Private mappings: Writes are private to the process. A private file mapping can initially share clean file pages, but the first write creates an anonymous copy through COW.
Shared mappings: Writes are visible to other processes mapping the same object. MAP_SHARED and POSIX shared memory use this model.

Aside: tmpfs: the file-anonymous hybrid

“Shared memory” as people commonly use the term (POSIX shared memory via shm_open, System V shared memory, /dev/shm) is a distinct concept from the shared mapping we just discussed. A shared mapping is simply one where writes are visible to other mappers. These shared memory APIs are higher-level mechanisms built on top of that idea; under the hood, they are typically backed by tmpfs.

tmpfs is a filesystem whose contents live entirely in memory and swap rather than on a persistent disk. A tmpfs file looks and behaves like an ordinary file: you can open(), mmap(), or fstat() it, but there is no disk backing it. If the system reboots, the contents are gone.

From a reclaim perspective, tmpfs pages behave more like anonymous memory than disk-backed file cache: they have no persistent disk file to reload from, so evicted dirty tmpfs pages go to swap. Internally, they still live in the page cache and are managed through the VFS like ordinary files, which is what makes the familiar file API work. This makes tmpfs useful as a fast inter-process communication channel: two processes can map the same file from /dev/shm with MAP_SHARED and share the same physical frames, while still using the ordinary file API.

Page Reclaim: How the Kernel Chooses What to Evict

Alloca has now seen swap and file-backed mappings, but she has only been told the simple version: when memory runs out, the kernel evicts something old. She wants to know how that choice is made.

Alloca: “When physical memory fills up, you said you pick a page that hasn’t been accessed recently. But how do you know that a page hasn’t been used in the recent past?”
Kernel: “I maintain a list of physical frames organized by how recently they appear to have been used. These are the LRU lists (least recently used). I simply scan these lists, starting from the coldest end and find a candidate page that can be evicted.”
Alloca: “But the question remains: how are these lists created and updated? Do you monitor each memory access to continuously update these lists?”
Kernel: “Watching every access in software would be impossibly expensive. So I rely on hardware’s help. Every page table entry has an accessed bit, which is there to indicate if a page was accessed. When the MMU performs a page table walk and uses a PTE to translate an address, it sets that bit automatically in that PTE. I don’t have to trap the access, I just come along later and look at what the hardware recorded.”
Alloca: “How does that work in practice? The MMU is setting the accessed bit in the page table entries, but you need to maintain and update LRU lists of frames. Do you actively go through all the page table entries of all processes and update the LRU lists?”
Kernel: “That would be just as expensive. Imagine iterating every virtual page of every running process on every reclaim cycle, you’d spend more time on bookkeeping than anything else. I take the opposite approach. I scan the LRU list from the coldest end, check the page table entries mapping to it and see if the accessed bit is set or not.”
Alloca: “How do you find out which PTEs map to a frame?”
Kernel: “That’s where reverse mappings come in, usually called rmap. The page table is a forward map: virtual address → physical frame. I also maintain the reverse: metadata attached to each physical frame that lets me find the VMAs and page table entries that currently map it. When I want to check whether a frame is warm, I follow its rmap to the relevant PTEs, and check the accessed bits.”
Alloca: “Ah, I was not aware that you also maintain reverse mappings. But I still don’t understand how all of this works together? You’ve given me pieces of the puzzle but the full picture is not clear.”
Kernel: “The confusion is understandable. Let’s connect everything together. When I have to reclaim memory, I start by scanning the coldest set of frames from the LRU list. Then I use the rmap to check the accessed bit of the pages mapping to those frames. If a frame’s accessed bit is not set, then it is a candidate for reclaim.”
Alloca: “And what if the accessed bit was set?”
Kernel: “Then things become interesting. If a frame’s accessed bit is set, it could mean that it has been accessed tens or hundreds of times, but it could also mean that it was accessed once since then it has gone cold. So, for such frames, I unset their accessed bit to give them a second chance. If the frame is scanned again later and the bit is still clear, then that is stronger evidence that it has gone cold.”

Aside: The kswapd daemon

Normally, Linux runs a background thread called kswapd that watches free-memory watermarks. When free memory drops below a threshold, kswapd wakes up and starts reclaiming pages before the situation becomes urgent.

If background reclaim cannot keep up, the allocating process may have to wait for reclaim. This is called direct reclaim, and it can show up as allocation latency in the application.

Alloca: “And, how are the LRU lists structured? You said you start from the coldest end, how do pages age toward that end?”
Kernel: “Although things are a bit more complex, I will simplify for you. Think of two lists: active and inactive, each having a head (newest) and a tail (oldest). When a new page is faulted in, it typically starts near the head of the inactive list. Over time, pages age toward the tail as newer pages push them back, or when colder pages get reclaimed.”
Alloca: “But if all the newly faulted pages start from the head of the inactive list, how does a page get promoted to the active list?”
Kernel: “A page that consistently shows its accessed bit set across multiple reclaim scans is promoted to the active list because it has demonstrated sustained use. From there, it ages toward the active tail again. When the active list grows too large, its tail pages are demoted back to the head of the inactive list. So the flow is: inactive tail is where eviction happens, active tail is where demotion back to inactive happens. Pages circulate through this cycle, and only those that consistently fail to show any access get evicted.”

Aside: Multigenerational LRU (MGLRU)

The active/inactive model works, but two buckets is a coarse instrument. The fundamental limitation is that it preserves only coarse aging information: it can tell that a page looked recently referenced at scan time, but it does not maintain a rich multi-step history of how its temperature changed over time. A page accessed ten thousand times since promotion looks effectively the same as one accessed once; a page that was hot for ages but cooled recently looks the same as one that was never warm. Under workloads with mixed access frequencies, periodic re-access patterns, or bursty I/O, this can lead to evicting pages that will soon be needed or retaining pages that will not.

MGLRU (multi-generational LRU) addresses the root cause by giving the kernel more expressive age information. Instead of two lists, pages are grouped into several generations, each representing a time window of access activity. Pages start in the youngest generation when first faulted or accessed. Without re-access they age into older generations; with re-access they are refreshed back into a younger one. Reclaim always targets the oldest generation first. With more age buckets, the cooling curve of a page becomes observable over time, allowing the kernel to make finer, more informed eviction decisions.

MGLRU was introduced in Linux 6.1. The build config option CONFIG_LRU_GEN=y includes the code and CONFIG_LRU_GEN_ENABLED=y enables it by default. When compiled in, /sys/kernel/mm/lru_gen/enabled controls it at runtime. Systems without it fall back to the classic active/inactive lists.

Alloca: “So the lists tell you which pages are cold. But once you’ve found a cold page, does it matter what kind of page it is? Is every cold page equally easy to evict?”
Kernel: “Not at all. The first split is file-backed versus anonymous. Clean file-backed pages are the easiest. If a page cache page matches the file on disk, I can drop it immediately and reuse the frame. The next access will fault and read it back from the file.”
Alloca: “What about dirty file-backed pages?”
Kernel: “Those need writeback. If a process wrote through write() or MAP_SHARED, the page cache page may be dirty. Before I can reclaim that frame, I need to schedule I/O to write the contents back to the filesystem. After writeback completes, the page becomes clean and cheap to drop. A MAP_PRIVATE write is different: the first write produces a private anonymous copy via COW. That copy has no file behind it, so there is no persistent home to reload from. To reclaim it safely I must write it to swap, same as any other anonymous page with real data in it.”
Alloca: “So under memory pressure, file cache tends to be easier to reclaim than heap memory.”
Kernel: “Often, yes, especially clean file cache. This is why free memory can look low while the system is healthy: much of RAM may be used as page cache, and clean cache can be reclaimed quickly when applications need memory. The dangerous case is when the active working sets of processes exceed RAM. Then I have to reclaim pages that will soon be needed again, and the system can start thrashing.”
Alloca: “Thrashing means constantly evicting and faulting the same pages back in?”
Kernel: “Right. The CPU spends more time waiting for page faults and disk I/O than doing useful work. At that point, virtual memory’s illusion of abundant memory has become too expensive to maintain.”

Key Takeaway

Page reclaim is the kernel’s mechanism for freeing physical frames under memory pressure. It is approximate, not perfect LRU. Two complementary mechanisms make it practical without being prohibitively expensive:

Accessed bits: Every page table entry has a hardware-maintained accessed bit that the MMU sets automatically when the CPU uses that mapping. The kernel reads and clears these bits periodically to estimate recency without trapping every memory access.
Reverse mappings (rmap): The page table is a forward map (virtual → physical). The kernel also maintains the reverse: metadata on each physical frame that lets it find the VMAs and page table entries that map it. Reclaim uses rmap to check accessed bits on candidate frames only, without scanning every process’s page table. This means reclaim starts from lists of physical frames, not from virtual address spaces, so the cost scales with the number of frames under consideration, not with the total size of all processes’ virtual memory.

Active/inactive LRU: Pages move between active and inactive lists. In Linux, these are split further into anonymous and file-backed LRUs, maintained per memory-management domain. New pages generally enter as inactive candidates. Pages age toward the tail as newer pages arrive. Reclaim scans from the tail of inactive, checking accessed bits via rmap for mapped pages:

Accessed bit set means that the page was recently used; clear the bit to give it a reprieve.
Accessed bit clear means that the page is cold; evict it.

Pages that are consistently accessed get promoted to the active list. When the active list grows too large, its tail pages are demoted back to the head of inactive. Pages cycle through this until they consistently fail to show any access.

MGLRU (multi-generational LRU) extends this with several age generations instead of two lists, allowing finer-grained decisions about what is truly cold.

The reclaim cost also depends heavily on page type:

Clean file-backed page: cheapest. Drop it immediately; a future access reloads from the file.

Dirty file-backed page: must be written back to storage before the frame can be reused.

Anonymous page with private data: generally needs swap before reclaim, because there is no file to reload it from. Without swap configured, ordinary anonymous pages are much harder to reclaim.

The practical consequence: “used memory” is not automatically bad. The RAM used for clean page cache is readily reclaimable. However, the real danger is when the combined hot working set of applications exceeds RAM, forcing the kernel to evict pages that will soon be needed again, causing thrashing.

Memory Access Patterns and VM Performance

Alloca has been running correctly for some time now. Her pages are backed, her TLB is warm, and demand paging has handled everything smoothly. But lately she’s noticed something odd: she has two data structures (a dense array and a hash table), each holding the same amount of data, both fitting entirely in RAM. When she scans through all elements in each, the array finishes in seconds. The hash table takes ten times longer.

Alloca: “Same amount of data. Both in RAM. Page table entries for both are installed. Why is the hash table so much slower?”
Kernel: “Because the virtual address space makes all memory look equally fast. It isn’t. The cost of an access depends on how it interacts with the layers underneath: the TLB, the cache, the physical layout.”
Alloca: “Tell me what’s different.”
Kernel: “When you scan the array, you move through virtual addresses in order. If the first element is at address 0x1000, and each element is 4 bytes, then the next is at 0x1004, then 0x1008, and so on. You stay within one 4 KB page for over a thousand consecutive accesses. Remember, the TLB caches completed virtual-to-physical translations, one entry per page. All those accesses within the same page reuse the same TLB entry, so they are fast. Then you cross into the next page and need one new entry. Only a small sliding window of TLB entries is active at any moment, and you reuse each one extensively before moving on. The TLB handles that easily.”
Alloca: “And with the hash table? I’m probing at random locations across the whole allocation.”
Kernel: “Yes, that’s where the problem is. Hash table probes are spread across the entire allocation with no fixed order. You might touch page 47, then page 3, then page 201. The CPU has a limited hierarchy of TLBs, a small L1 TLB and a slightly larger second-level TLB. Together they may cover hundreds to a few thousand page translations depending on the CPU and page size. As your probe set fans out across many pages, the TLB hierarchy fills up. When it’s full, a new translation evicts an old one. The trouble is that with no locality in your access pattern, the evicted translation is often the one you’ll need again soon. By the time you revisit a page, its translation is likely gone, and the hardware may have to walk the page table again to rebuild it.”
Alloca: “So if a translation misses across the TLB hierarchy, the hardware has to do a page walk before I can even access the data?”
Kernel: “Right. For random access across a large range, you can be spending significant overhead on translation for every byte you actually wanted. And TLB pressure isn’t the only thing working against you. There’s also the hardware prefetcher. When you access virtual addresses in a predictable pattern, the CPU detects it and starts fetching upcoming cache lines before you ask for them. For your array scan, you’re reading 0x1000, 0x1004, 0x1008 in sequence, so the prefetcher loads the next cache lines ahead of time.”
Alloca: “But what if the next address crosses into the next virtual page?”
Kernel: “Usually the hardware prefetchers are conservative around 4 KB page boundaries because crossing into the next page could cause a page fault or run into permission issues.”
Alloca: “Understood. Each array page holds over a thousand elements. So the prefetcher helps throughout each page, and the cost of crossing into the next is just one TLB lookup?”
Kernel: “Correct. For your hash table, the random probes defeat the prefetcher even within a single page because there’s no predictable pattern to detect. So the array wins twice: fewer distinct TLB entries needed, and hardware prefetching next set of cachelines.”
Alloca: “Is there anything else that affects this?”
Kernel: “Yes, how often you revisit the same pages. If you keep accessing the same set of pages over and over, those pages stay hot. Their TLB entries stay cached, so you’re not constantly rebuilding translations. And those physical frames stay in RAM because my reclaim policy notices they’re being used frequently. I’m less likely to evict a page that’s getting hammered than one that hasn’t been touched in a while.”
Alloca: “So if my working set is small enough to fit in the TLB and I keep reusing it, I’m golden?”
Kernel: “Exactly. A tight working set is cheap. But if your working set is sprawling across hundreds of thousands of pages that you only touch occasionally, you’re constantly evicting TLB entries you’ll need again soon. And under memory pressure, those infrequently-accessed pages become candidates for eviction to swap. Then you’re not just paying for TLB misses, you’re paying for disk I/O to bring pages back from swap.”
Alloca: “So the key is to touch fewer pages. Is there anything I can do to control this?”
Kernel: “Absolutely. One thing that’s often overlooked is how tightly you pack your data. The virtual memory system operates at page granularity, so anything that helps you fit more useful data into each page reduces the number of pages, translations, and TLB entries needed for the same logical work.”

Aside: Data layout also changes TLB footprint

Compilers often pad structs to satisfy alignment requirements, but struct padding is not just a local layout detail. It also affects how much memory an array of those structs occupies, and therefore how many cache lines and pages the program touches.

Suppose you have a struct with a char, then an 8-byte pointer, then another char. On a typical 64-bit system, the compiler may insert padding after the first char to align the pointer, and then more padding at the end so that each element in an array keeps the pointer correctly aligned. The result may be 24 bytes per struct, even though the actual fields occupy only 10 bytes.

Across a million elements, that difference matters. A 24-byte layout occupies about 24 MB, while a more compact reordered layout may occupy about 16 MB. With 4 KB pages, the larger layout spans more pages. More pages means more TLB entries are needed to cover the same number of logical objects, more page-table walks when the TLB misses, and more memory that the kernel may have to manage under pressure.

One common way to reduce padding is to order fields from larger alignment requirements to smaller ones: 8-byte fields first, then 4-byte fields, then 2-byte fields, then 1-byte fields. The compiler may still add tail padding, but usually less than when different-sized fields are interleaved randomly.

Key Takeaway

Virtual memory makes all addresses look the same, but they’re not. The CPU has a limited TLB hierarchy, with small L1 TLBs backed by larger second-level TLBs. Together, they cover a limited number of translations, typically a few hundred to a few thousand, depending on the CPU and page size. Once your working set spans more pages than the TLB hierarchy can cover, translation misses become more common. Misses that hit in the second-level TLB are cheaper, but misses that require a hardware page walk can be expensive.

How you access memory matters a lot. If you walk through an array sequentially, you stay within a small number of pages at any given time. You reuse the same TLB entries for thousands of accesses before moving to the next page. The hardware prefetcher can see the pattern and load upcoming data into cache before you ask for it (at least until you hit a page boundary, where it has to stop). That’s why sequential scans are fast.

Random access is a different story. When you jump around unpredictably, like probing a hash table, or chasing linked list pointers, you may land on different pages very frequently. As a result, you may face TLB misses for pages that are being visited for the first time, and also you risk evicting TLB entries you’ll need again soon. The prefetcher can’t predict where you’re going next, so it doesn’t help. In the worst case scenario, every access risks a TLB miss and a page walk.

Temporal locality matters too. If you keep revisiting the same pages, they stay hot. Their translations stay cached in the TLB. The kernel is less likely to reclaim frequently used pages, because they tend to be recognized as part of the active working set. Under severe pressure, though, even useful pages can still be reclaimed. But if your working set is sprawling and you rarely touch the same page twice, you’re constantly rebuilding translations and building up memory pressure.

How you pack your data affects how many pages you touch. A poorly-designed struct with lots of padding might be twice the size of a well-packed one. If you have an array of a million structs, that can result in a difference of 6000 vs 3000 pages. Same logical work, but one version fits in the TLB and the other thrashes. Every byte you save per element multiplies across the whole working set: fewer cache lines, fewer pages, fewer translations, fewer page walks, and less memory pressure.

The VM machinery works largely at page granularity while caches operate at cache-line granularity. Performance-conscious code thinks about how data is laid out in both cache lines and pages, how those pages fit in the TLB, and how access patterns interact with the translation machinery.

Huge Pages and TLB Efficiency

Alloca has redesigned her hash table. Better hash function, reduced load factor. She accepts that random access is unavoidable. But she is still spending too much time on TLB misses. For a 2 GB table with 4 KB pages, the math is unforgiving: half a million pages, and no TLB holds that many entries.

Alloca: “I understand the TLB problem. My 2 GB table spans half a million 4 KB pages. The TLB can only hold a limited number of translations. I will always be missing. What can I do besides shrinking the data?”
Kernel: “You can change the page size. The TLB has a fixed capacity, you can’t change it. But what you can change is how much memory each entry covers. x86-64 supports huge pages with sizes 2 MB, and on many systems 1 GB pages as well. A single 2 MB TLB entry covers 512 times as much memory as a 4 KB entry. So your 2 GB hash table mapped with 2 MB pages needs only 1,024 TLB entries instead of half a million”
Alloca: “That is dramatically fewer. But, how does this work with the page table hierarchy?”
Kernel: “The page table walk has an early-exit mechanism when you use huge pages. Each page table entry has a set of flags embedded in its low bits. One of those flags is the page-size bit (PS) which tells the hardware: ‘stop here, this entry points directly at a physical frame, not at another table.’ For a normal 4 KB mapping, the PMD entry points to a PTE table, and the walk continues. But when the PS bit is set on the PMD entry instead, the hardware treats the PMD entry itself as the final frame mapping, covering 2 MB at once. It skips the PTE level entirely. The 21 low-order bits of the virtual address become the offset within the 2 MB frame instead of requiring a further table lookup. Similarly, if the PS bit is set on a PUD entry, the hardware stops there and maps 1 GB directly, skipping both the PMD and PTE levels.”

Figure 12: Huge page early-exit paths through the page table hierarchy. A normal 4 KB access walks all four levels. A 2 MB huge page stops at the PMD level (the PMD entry has the page-size flag set); the lower 21 bits of the virtual address become the offset within the 2 MB page, so no PTE lookup is needed. A 1 GB huge page stops at the PUD level; the lower 30 bits become the offset within the 1 GB page.

Alloca: “Fewer levels in the walk, fewer TLB entries needed. What is the catch?”
Kernel: “Physical contiguity. A 2 MB huge page needs 512 physically contiguous 4 KB frames, and the starting address has to be aligned to a 2 MB boundary. For a regular 4 KB page, I can grab any single free frame from anywhere in physical memory. It’s easy. But for a huge page, I need to find a 2 MB-aligned block where all 512 frames are sitting right next to each other, and they all have to be free at the same time. After the system has been running for a while, physical memory gets fragmented. Small allocations come and go, leaving little gaps everywhere. Finding a big contiguous block with the right alignment gets harder and harder. I can try compaction, where I migrate pages around to assemble larger free ranges, but there’s no guarantee it’ll work.”
Alloca: “So huge pages are generally easier to get on a fresh system and harder as long-running workloads fragment memory?”
Kernel: “That’s the usual pattern, yes. So how do you get them reliably? One answer is to reserve a pool upfront, ideally at boot before memory has had a chance to fragment. You set vm.nr_hugepages, I carve out that many huge pages and hold them aside. They’re always contiguous, always aligned, always ready. When you ask for one, I hand it out instantly. The catch: that memory stays off-limits for anything else for as long as it’s in the pool, even when nothing is using it.”
Alloca: “And if I don’t want to lock memory away like that?”
Kernel: “That’s where Transparent Huge Pages, or THP, comes in. THP tries to give you huge pages without a dedicated pool. Sometimes I can allocate one directly when you first fault a region. Other times, a background daemon called khugepaged scans your anonymous mappings and collapses a 2 MB-aligned range of base pages into a single huge page after the fact. Your mapping gets upgraded silently, no code changes needed.”
Alloca: “So THP might help and might not, and I have no guarantee which I got.”
Kernel: “Right. It’s opportunistic. It runs into the same fragmentation problem I described earlier, finding a 2 MB-aligned contiguous block on a system that’s been running for a while is not always possible. If the block isn’t there, nothing happens and you stay on base pages. The other risk is that THP may try to create that contiguous block by running compaction first, migrating pages around to free up the space. Compaction is expensive and can cause latency spikes, which is why some latency-sensitive systems disable THP entirely. For predictable huge page coverage, like a database buffer pool, a large in-memory cache, anything where sudden jitter is unacceptable, you’re better off reserving the pool explicitly at boot.”

Key Takeaway

On x86-64, the base page size is 4 KB, but the architecture also supports larger leaf mappings: 2 MB pages (a PMD-level leaf entry, skipping the PTE table), and on systems with appropriate hardware support, 1 GB pages (a PUD-level leaf entry, skipping both PMD and PTE levels). Each covers correspondingly more memory per TLB entry and requires fewer levels in the page table walk on a TLB miss.

The key constraint is physical contiguity: a 2 MB huge page requires 512 physically contiguous, correctly aligned frames. Physical memory fragmentation, which accumulates over time as the system allocates and frees memory of different sizes, makes this progressively harder to satisfy.

Linux provides two mechanisms:

Explicit huge pages (configured via vm.nr_hugepages or at boot): drawn from a dedicated HugeTLB pool. Reserving them at boot is the most reliable way to avoid fragmentation. Memory in the pool is reserved for HugeTLB use while it remains there, i.e., it cannot be used as ordinary pages, but the pool size can be reduced later to release pages back, subject to fragmentation.
Transparent Huge Pages (THP): opportunistic huge-page backing for ordinary mappings, especially anonymous memory, either through fault-time huge-page allocation or later background collapse by khugepaged. Falls back to base pages when a suitable huge page cannot be allocated or assembled; depending on THP settings, the attempt itself may trigger compaction and latency spikes.

For latency-sensitive workloads with large, frequently-accessed memory regions, explicit huge pages provide the reliable TLB reduction that THP cannot guarantee. The trade-off is granularity: larger pages reduce translation overhead but can waste memory and are harder for the kernel to allocate.

TLB Shootdowns on Multi-Core Systems

Alloca has spawned dozens of worker threads. They’re distributed across the machine’s cores, all working in parallel. Everything runs smoothly until she decides to release a large memory mapping she no longer needs.

Alloca: “I used mmap earlier to create a large shared memory region. Now I’m done with it. How do I give it back?”
Kernel: “You call munmap. It’s the counterpart to mmap. You pass the starting virtual address and the length, and I clean up the range: the VMAs are removed, the page-table entries are cleared. Physical pages that nothing else is pointing to get released back to wherever they came from.”
Alloca: “That sounds straightforward.”
Kernel: “It would be, if you were running on a single core. But you’re not. You have dozens of threads running in parallel across multiple CPU cores. And, every core carries its own private TLB.”
Alloca: “Wait, they don’t share a single TLB?”
Kernel: “No. Every core keeps its own private cache of recent translations. On a multi-core machine, when your thread accesses memory, the MMU on that specific core checks that core’s TLB. If it misses, the page walk happens, and the result gets cached in that core’s TLB. Other cores don’t see that entry unless they independently translate the same address and cache it themselves.”
Alloca: “So if thread A on core 0 and thread B on core 1 both access the same virtual address, they each have their own TLB entry for it?”
Kernel: “Exactly. Both cores translate the same virtual address to the same physical frame, but they cache that translation independently. This per-core design is essential for performance, sharing a single TLB across dozens of cores would create a massive bottleneck. But it creates a consistency problem when page tables change.”
Alloca: “What kind of problem?”
Kernel: “Think about what happens when you call munmap. You’re on core 0. I clear the PTEs for the region you’re releasing. But cores 1, 2, 3… they might still have cached translations for pages in that region. Those TLB entries now point to frames that you just gave back to me.”
Alloca: “And you might reassign those frames to someone else immediately.”
Kernel: “Yes. Without explicitly invalidating those cached translations, a CPU could keep using a stale translation after I have decided the mapping is gone. If the underlying page were later reused for something else, that would be a disaster. I cannot allow that to happen.”
Alloca: “So before munmap finishes, you need to make sure every core’s TLB is consistent with the cleared page table?”
Kernel: “Yes. And that’s expensive.”
Alloca: “How do you do it?”
Kernel: “I send inter-processor interrupts (IPIs), to every CPU core that might hold stale translations for this address space. When a core receives the IPI, it stops what it’s doing, runs a short TLB flush routine to invalidate the affected entries, and sends an acknowledgment back. I wait for all cores to acknowledge before I let your munmap call complete. This is called a TLB shootdown.”

Aside: What is an inter-processor interrupt?

Modern CPUs have a hardware mechanism called the APIC (Advanced Programmable Interrupt Controller) that lets one CPU core send an interrupt directly to another. This is an inter-processor interrupt, or IPI. Unlike a regular device interrupt, which is triggered by external hardware (a disk, a network card), an IPI is sent by software running on one core to deliberately interrupt a different core.

When a core receives an IPI, it stops whatever it was doing, saves its state, and jumps to a an interrupt handler. For TLB shootdowns, that handler executes instructions to invalidate the stale TLB entries, then signals acknowledgment and returns to the interrupted work. The sending core waits until all targeted cores have acknowledged before proceeding.

This mechanism is general-purpose. The kernel uses IPIs for TLB shootdowns, but also for things like delivering signals across cores, triggering scheduler reschedules, and stopping cores for kernel panics or suspend.

Alloca: “Every core has to stop and flush, even if they’re in the middle of something?”
Kernel: “Yes, if they might have cached translations for your address space. If a core has never run any of your threads, I can skip it. But if a thread has been running on a core recently, that core’s TLB might still hold entries for your address space. I send the IPI, that core stops, flushes the relevant entries, and I wait for it to confirm before letting your munmap complete. So, you’re waiting for cross-core synchronization.”
Alloca: “That’s why it takes so long. The more cores, the more coordination required.”
Kernel: “Precisely. On a large machine, a single munmap can involve many cores being interrupted and synchronized. The cost tends to grow with the number of relevant cores, and it also depends on how I choose to invalidate the affected range, whether I flush individual pages or do a broader flush.”
Alloca: “When else does this happen?”
Kernel: “Anywhere I have to change or remove page-table entries that other CPUs might already have cached. mprotect is the obvious case: you change permissions, and the translation that other cores have cached is now wrong. The same thing happens during page reclaim and migration, when I unmap pages to move or free them. Copy-on-write faults in a multithreaded process can trigger it too, since other threads on other cores might have the old read-only translation cached. The more frequently these happen in a tight loop, the more cross-core coordination overhead you’re paying.”
Alloca: “So freeing memory and changing mappings or permissions can force expensive cross-core coordination on large machines.”
Kernel: “In the worst case, yes. The general principle is that page-table changes are not just local bookkeeping. On a multi-core machine, they can force cross-core synchronization before the operation is complete.”

Key Takeaway

On a multi-core machine, each CPU core has its own TLB. This per-core design is essential for scalability, a shared TLB would be a massive bottleneck with dozens of cores competing for access. But it creates a consistency challenge: when the kernel modifies page table entries, other cores may still have cached the old translations.

munmap is the system call that releases a mapping created by mmap. Allocators may also reduce the process heap with brk/sbrk or return large mmap allocations with munmap, but the common issue is the same: page table entries for a virtual address range are removed or changed. Clearing the page table isn’t enough. If another core still has a stale TLB entry pointing to a frame that has just been freed and potentially reassigned to another process, that core could access memory it shouldn’t, violating isolation.

The fix is a TLB shootdown: the kernel sends inter-processor interrupts (IPIs) to all CPUs that might hold stale mappings for that address space. Each interrupted CPU flushes the relevant TLB entries. For synchronous invalidations, the operation cannot safely complete until the targeted CPUs have performed the required flush. This forces cross-core synchronization before the operation can proceed.

Shootdown cost tends to grow with the number of targeted CPUs and with how disruptive the chosen flush strategy is. On x86, the kernel may invalidate individual pages or choose a broader TLB flush; the choice depends on the size of the range and the cost of flushing unrelated entries. On machines with many cores, munmap and mprotect on large regions can become significant bottlenecks.

TLB shootdowns arise whenever page-table mappings are modified: mprotect (permission changes), page reclaim and migration (unmapping pages to move or free them), and copy-on-write faults in multithreaded processes

The practical implication is to minimize page table invalidations in hot paths. High-performance allocators reduce munmap frequency by caching freed memory and batching OS returns. In hot paths, prefer reusing large, longer-lived mappings over repeatedly creating, protecting, unprotecting, and destroying small mappings.

NUMA (Non-Uniform Memory Access): The Physical Topology of Memory

Alloca has been running smoothly. Her pages are backed by huge pages where possible, her working set fits comfortably in the TLB, and her threads coordinate to minimize expensive operations like munmap. She has dozens of worker threads, each processing data from a shared buffer in memory.

But something is wrong. She’s noticing a strange inconsistency: some of her threads complete their work quickly. Others, doing exactly the same computation on the same amount of data, take much longer. It’s not occasional, it’s consistent. Threads 0-23 are fast. Threads 24-47 are slow.

Alloca: “I don’t understand. Half of my threads are stuck waiting for memory while the other half run at full speed. They’re all doing the same work, accessing the same buffer. Why would memory be fast for some threads and slow for others?”
Kernel: “Come with me. I want to show you something about the physical machine underneath your address space.”

Kernel leads Alloca to a view she has never been shown before, not the virtual address space, but the physical hardware topology beneath it.

Figure 13: NUMA topology showing two CPU sockets, each with local memory. In this simplified model, each socket corresponds to one NUMA node, but real machines, particularly AMD EPYC systems, may expose more than one NUMA node per socket. Alloca’s buffer was initialized by a thread on Socket 0, so all physical frames landed on NUMA Node 0. Threads 0-23 running on Socket 0 get fast local DRAM access. Threads 24-47 running on Socket 1 must have their cache misses served from Node 0, crossing the inter-socket interconnect. Local DRAM latency is typically around ~100ns; remote DRAM access is often 1.5–3× higher, though exact numbers vary by CPU generation, memory speed, and system topology.

Kernel: “This server has two CPU sockets. Each socket has its own pool of RAM wired directly to it. When a CPU on socket 0 reads from memory attached to socket 0, it’s a short trip, maybe 100 nanoseconds. Fast.”
Alloca: “And what about reading from the other socket’s memory?”
Kernel: “That’s where the problem appears. Socket 0 and socket 1 are connected by an inter-socket link. When a CPU on socket 0 needs data from memory attached to socket 1, the request must cross that link. Round trip takes two to three times longer.”
Alloca: “But my virtual address space… it’s just a flat range of addresses. How would I even know which memory is on which socket?”
Kernel: “You don’t. That’s the problem. Your virtual addresses are completely abstract. Address 0x10000 and address 0x20000 look identical to you. But behind the scenes, one might map to a physical frame on socket 0, and the other to a frame on socket 1. The virtual memory system hides that completely.”
Alloca: “So the physical location of my data determines performance, but I have no control over it?”
Kernel: “You do have control, but it’s indirect. The key moment is when a page is first accessed. Remember demand paging? When you touch a page for the first time, I have to allocate a physical frame for it. At that moment, I need to decide which NUMA node to allocate from.”
Alloca: “How do you decide?”
Kernel: “By default, I use what’s called first-touch placement. Whichever CPU core triggers the page fault gets to decide. I allocate the frame from that core’s local NUMA node. So if your thread running on core 5 (which is on socket 0) is the first to touch a page, that page’s frame lands on socket 0’s memory pool.”
Alloca: “Okay, so the first thread to touch a page determines where it lives physically.”
Kernel: “Yes. Now think about what probably happened with your buffer. You likely had one thread, maybe your main thread that initialized the buffer. That thread touched every page in sequence, probably while running on socket 0. Every single page fault was handled by a CPU on socket 0, so every single frame landed on socket 0’s memory.”
Alloca: “And then I handed that buffer to all my worker threads?”
Kernel: “Right. And those threads are distributed across both sockets. Threads 0 through 23 run on socket 0, when they access the buffer, the memory is local, everything is fast. But threads 24 through 47 run on socket 1. Any cache miss they take resolves as a DRAM fetch, and that DRAM is on the wrong socket, the access has to cross the inter-socket interconnect. That’s typically two to three times the latency of a local DRAM fetch.”
Alloca: “That explains the performance split perfectly. So the thread that initializes the data and the threads that use it need to be on the same socket?”
Kernel: “That’s one solution. For partitioned data where each thread works on its own section, you can have each thread initialize its own portion while pinned to the socket where it’ll do the work. The first-touch policy ensures the data lands locally.”
Alloca: “What if the data is shared? All my threads are reading the same buffer.”
Kernel: “Then you have a harder problem. No matter where you put the data, it’s local for some threads and remote for others. One approach is to use explicit NUMA policies. The mbind system call lets you control allocation policy for a specific virtual address range.”
Alloca: “What can I do with it?”
Kernel: “Several things. You can bind a range to a specific NUMA node, force all its pages onto one socket’s memory. You can set a preferred node that’s tried first but allows fallback. Or you can interleave pages across nodes, where consecutive pages alternate between socket 0 and socket 1.”
Alloca: “Why would I want to interleave?”
Kernel: “Interleaving is useful for heavily shared data with high bandwidth demand. Think about it, if all your threads are hammering the same memory range, putting it all on one socket creates a bottleneck, all the traffic goes through one memory controller. With interleaving, each socket sees a mix of local and remote pages when scanning the range, but the bandwidth demand is spread across both memory controllers rather than concentrating on one. You’re trading some locality for better aggregate throughput.”
Alloca: “Understood. Is there also the possibility of the scheduler moving my threads between sockets after I’ve set everything up?”
Kernel: “Yes, in that case your careful placement falls apart. If a thread that was running on socket 0 with local memory gets migrated to socket 1, then suddenly all its memory is remote. This is why NUMA-sensitive workloads typically pin threads to specific CPUs using taskset or pthread_setaffinity_np.”
Alloca: “So the typical pattern is: decide which threads work on which data, pin those threads to the appropriate socket’s cores, and make sure the thread that first touches the data is running on the right socket so first-touch puts the frames locally.”
Kernel: “That’s the basic approach for thread-private or partitioned data. For shared data, you either accept that some accesses will be remote, or you interleave to balance the load. There’s no perfect solution when multiple sockets need heavy access to the same memory. You’re always trading off between locality and bandwidth distribution.”

Aside: Automatic NUMA balancing

Linux also provides automatic NUMA balancing, controlled via /proc/sys/kernel/numa_balancing. When enabled, the kernel periodically samples a task’s memory by temporarily unmapping pages, or marking them so that the next access triggers a NUMA hinting fault. The fault lets the kernel record which CPU or NUMA node is actually accessing it. Based on those faults, the kernel may migrate pages toward the node that uses them, or move tasks closer to their memory. This can improve placement without code changes, though the sampling faults and migrations add overhead and are not guaranteed to help every workload.

The downside is that it is reactive. It adapts after the fact rather than placing memory correctly from the start, and the sampling-induced faults add a small overhead. For workloads where latency consistency matters, deliberate placement with `mbind` and thread pinning is more reliable. For workloads where access patterns are hard to predict or partition, automatic balancing can be a reasonable hands-off alternative.

Key Takeaway

Modern multi-socket servers are NUMA (Non-Uniform Memory Access) systems. Physical memory is divided into NUMA nodes, each directly attached to one CPU socket. A CPU can access memory on any node, but local access is noticeably faster than remote access, which must traverse the inter-socket interconnect.

The virtual address space hides this topology completely: two adjacent virtual pages may be backed by physical frames on different NUMA nodes. The NUMA node of a physical frame is primarily determined at allocation time by the kernel’s memory policy.

The kernel’s default policy for anonymous memory is effectively first-touch: when a page is first faulted into a real physical frame, it is usually allocated from the NUMA node local to the CPU handling that fault. If initialization and hot access happen on different sockets, most DRAM accesses will pay remote latency.

Strategies for NUMA-aware operation:

Initialize on the accessing socket: for partitioned data, the thread that will perform the hot accesses should also touch pages first, placing frames on the local node.
Thread pinning: bind threads to specific CPUs with taskset or pthread_setaffinity_np to prevent cross-socket migration.
mbind / set_mempolicy: per-range NUMA allocation policy in code.
numactl: command-line wrapper to set NUMA policy for an entire process.
Interleaving: for heavily shared data accessed across sockets, interleaving pages across nodes distributes bandwidth demand across multiple memory controllers. Each socket sees a mix of local and remote pages, but no single memory controller becomes a bottleneck.
Automatic NUMA balancing: the kernel can be configured to sample memory access patterns at runtime and migrate pages or tasks toward the nodes that use them most (/proc/sys/kernel/numa_balancing). It requires no code changes but is reactive rather than proactive, it adapts after observing bad placement rather than preventing it. For latency-sensitive workloads, deliberate placement is more reliable.

For shared data accessed heavily by multiple sockets, no placement is perfect: the trade-off is between locality, bandwidth balance, and sometimes deliberate replication.

For data-intensive workloads on multi-socket servers, NUMA is often the dominant source of unexplained memory latency once TLB and cache behavior have been addressed.

Observing Virtual Memory in Practice

Our journey through the virtual memory world with Alloca ends here. We have covered the machinery of the modern Linux kernel from first principles. For this final section, I will switch back to my normal voice and cover the observability and debugging tools that let you actually see what is happening in a running system.

Understanding the mechanisms is one thing; knowing where to look when something goes wrong is another. Memory problems tend to disguise themselves. A process using more memory than expected, a workload that fits in RAM but still feels sluggish, a system that gradually slows down under load — each of these points to a different layer of the VM stack. The tools below correspond to those layers. Work through them in order when you are unsure where the problem lives.

Step 1: What address ranges does the process have?

Before anything else, look at what the process has actually mapped. /proc//maps lists every VMA: the virtual address range, the permissions (r, w, x, and p/s for private or shared), the offset into any backing file, and the file name if there is one. You can see the heap, the stack, the shared libraries, and any mmap regions all in one place.

This is the reservation view. It tells you what address ranges exist and what they are allowed to do, but says nothing about how much physical memory is actually backing them. A region that looks large here might have almost no physical pages behind it, demand paging means pages are only allocated on first touch. pmap -x presents the same information in a slightly more readable table format.

Step 2: How much physical memory is the process actually using?

smaps is maps extended with a full accounting breakdown for every VMA. It tells us “what is actually in RAM.” The key fields to understand:

Rss (Resident Set Size): how many kilobytes of that VMA are currently in physical RAM. Pages that have never been touched, clean file-backed pages that have been reclaimed, or anonymous pages that have been swapped out all contribute nothing here.
Pss (Proportional Set Size): like Rss, but shared pages are divided proportionally among all processes that map them. If ten processes share a 4 KB library page, each is charged 0.4 KB.
Private_Clean / Private_Dirty: pages private to this process that either still match their backing file (clean) or have been written to and diverged (dirty).
Shared_Clean / Shared_Dirty: pages shared with other processes. Clean shared pages, like read-only library code, are cheap to reclaim. Dirty shared pages need to be cleaned first: file-backed ones require writeback to disk, while shmem/tmpfs dirty pages go to swap instead.
AnonHugePages: how many bytes of this VMA are backed by transparent huge pages. If you want to verify that THP is actually working for a particular region, this is the field to check.

For the system-wide picture, /proc/meminfo is the companion. The fields worth checking are MemAvailable (the kernel’s estimate of how much can be freed without touching swap), Cached (page cache, most of which is reclaimable), Dirty and Writeback (pages queued for or actively being written back), AnonPages (anonymous pages currently in RAM), and the swap fields: SwapTotal, SwapFree.

Step 3: Is the process triggering disk I/O through page faults?

Page faults are the mechanism that connects virtual addresses to physical memory, and they come in two very different varieties.

Minor faults (ru_minflt via getrusage) are resolved without any disk I/O. They involve a kernel trap and some bookkeeping, but no waiting for storage. A large number of minor faults during startup is perfectly normal.

Major faults (ru_majflt via getrusage, or major-faults in perf stat) are a different story. These required actual disk I/O, either reading a cold file page from storage, or bringing a page back from swap. On spinning disks, a major fault can easily take several milliseconds; on NVMe it might be a few hundred microseconds. Either way, sustained major faults in a steady-state hot path are a warning sign. They usually point to swap pressure, uncached memory-mapped file I/O, or a working set that is competing with the rest of the system for physical memory.

To measure fault counts for a single run:

perf stat -e page-faults,major-faults ./your-program

page-faults counts total faults; minor faults are approximately the difference from major.

Step 4: Is the whole system under memory pressure?

Once you have the process-level picture, zoom out to see whether the kernel itself is struggling.

vmstat 1 samples every second. The columns to watch are si and so (swap-in and swap-out in KiB per second). Nonzero so means the kernel is writing pages to swap because reclaim pressure has reached anonymous memory. Nonzero si means pages are being faulted back in. Both together at the same time is the classic thrashing pattern. The b column counts tasks currently blocked on I/O, which includes swap I/O.

Pressure Stall Information (PSI) at /proc/pressure/memory gives a finer picture. It reports the fraction of time tasks spent stalled waiting for memory: some means at least one task was stalled; full means all non-idle tasks were stalled simultaneously, i.e., the system was making zero forward progress. A machine where the full metric is climbing steadily is one where memory has become a genuine bottleneck, not just busy, but actively blocking work from completing.

Step 5: Is translation itself the bottleneck?

TLB misses are almost entirely invisible to the kernel. The MMU handles them in hardware via page-table walks; the kernel only gets involved if the walk faults because the page isn’t present. To observe TLB behavior you have to go to the hardware performance counters, which perf exposes.

perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-load-misses ./your-program

dTLB-load-misses and dTLB-store-misses count data TLB misses on loads and stores respectively. iTLB-load-misses tracks instruction TLB misses, which matters when the code footprint is large or when working with JIT-compiled code. Note that the event names vary by CPU generation; perf list | grep -i tlb shows what your machine exposes.

As we learned in the article, a high TLB miss count alone doesn’t tell you much, what matters is whether those misses are triggering expensive page-table walks. A miss that hits the second-level TLB is relatively cheap, but the one that requires a full hardware page walk is not. For the actual walk cost, look for events like dtlb_load_misses.walk_active on Intel processors, which counts cycles spent actively walking page tables.

High TLB miss rates combined with low major-fault counts (data is in RAM but translations are not cached) point to a working set that has outgrown the TLB hierarchy. The remedies are the ones covered earlier: huge pages to reduce the number of entries needed, or tighter data packing to reduce the number of distinct pages touched.

Step 6: Are some threads slower than others on identical work?

If some threads consistently take longer than others doing the same computation, and the disparity is stable rather than random, NUMA placement is the first thing to check.

numactl --hardware shows the machine’s NUMA topology: the number of nodes, memory per node, and the distance matrix between nodes. The distance matrix is a relative latency measure. This tells you the penalty being paid per remote access.

numastat -p shows where a process’s pages actually live. If the bulk of the pages are on node 0 but the threads doing the work are running on node 1, that is first-touch misalignment in practice. /proc//numa_maps provides the same information per VMA, including which NUMA policy is in effect for each region and how many pages have landed on each node. It is verbose but precise when you need to understand why a specific mapping ended up where it did.

Putting it together

Virtual memory problems almost always start as a vague symptom. The right approach is to peel back layers in order rather than guessing:

Is memory actually being used, or just reserved? Compare VMA size in maps to Rss in smaps. Large reserved-but-not-resident regions are normal (lazy allocation). Unexpectedly large Rss is the real signal.
Is the process responsible for that memory, or is it shared? Compare Rss to Pss. If Rss is large but Pss is small, you’re mostly mapping shared libraries or shared regions that other processes are also paying for.
Is the process triggering frequent disk I/O through page faults? Check major fault count via perf stat or getrusage. Sustained major faults in a steady-state workload usually mean swap pressure, uncached mmap/file-backed I/O, or a working set that does not fit in available RAM or page cache.
Is the system reclaiming memory aggressively? Check vmstat for swap-in/out activity and PSI for actual stall time. High si/so with high PSI full is a system in memory distress.
Is translation overhead high even with data fully in RAM? Check TLB miss rates and page-walk cycles via perf stat. High miss rates with low fault counts point to a working set that has outgrown the TLB, a case for huge pages or tighter data packing.
Are some threads consistently slower than others on the same work? Check NUMA placement via numastat -p and /proc//numa_maps. Asymmetric slowness with equal work is a NUMA symptom, but confirm it against CPU placement, page placement, and other sources of per-core variation such as thermal throttling, IRQ affinity, or lock contention.

What We’ve Learned

In this article, we explored virtual memory through a dialogue between the kernel and a user-space process named Alloca. Along the way, we covered a lot of ground: address spaces, page tables, TLBs, demand paging, memory types, page reclaim, copy-on-write, mmap, huge pages, TLB shootdowns, NUMA, observability, and more.

Let’s end this article with a summary of everything that we learned.

Providing memory-level isolation is the foundational problem that virtual memory solves. Each process gets its own private set of virtual addresses, and the MMU enforces the boundaries between them. No process can directly read or write another’s memory.

Giving the address space structure is the next step. The virtual address space is divided into segments like code, data, heap, and stack, each with different permissions and growth behavior. Code is read-only and executable; the stack grows down on demand; the heap grows up through allocator requests.

Mapping every byte to a physical location is impractical. A flat table covering the full 128 TB user address space would itself consume 256 GB. The solution is fixed-size pages and frames with hierarchical page tables: memory is divided into 4 KB chunks, any frame can back any page, and the page table hierarchy only allocates levels for address ranges actually in use.

Walking four levels of page table on every memory access would be too slow. The TLB caches recent virtual-to-physical translations so that most accesses skip the walk entirely. Hit rate depends on access patterns and how tightly the working set fits within the number of TLB entries available.

Allocating physical frames at malloc time wastes memory. Demand paging defers the allocation: when a process reserves memory, the kernel records the promise in a VMA but does not assign physical frames. Frames are allocated only on first access, when a page fault fires.

Not all pages cost the same to evict. The kernel distinguishes anonymous memory (heap, stack, and MAP_ANONYMOUS regions), file-backed memory (executables, shared libraries, mmap’d files), and tmpfs-backed shared memory. Clean file-backed pages can be dropped immediately and reloaded from disk. Dirty file-backed pages must be written back first. Anonymous and tmpfs pages need swap space because there is no file to reload them from.

Physical memory fills up. Page reclaim is the kernel’s mechanism for freeing frames under pressure. It uses hardware-maintained accessed bits to estimate recency without trapping every access, reverse mappings (rmap) to find which page table entries point to a given frame, and active/inactive LRU lists to identify cold pages. The goal is to evict cold pages while keeping hot working sets in RAM. Evicting pages that will soon be needed again causes thrashing.

Copying all of a process’s memory on fork is too slow. Copy-on-write shares physical frames between parent and child after fork. Pages are only copied when one side actually writes to them, tracked with per-frame reference counts. This makes fork nearly instantaneous regardless of address space size.

File I/O through a user buffer requires an extra copy. mmap maps page cache frames directly into the process address space, allowing the process to read file data without a separate copy from kernel buffer to user buffer. Multiple processes mapping the same file share the same physical frames.

Random access patterns scatter across too many pages. Sequential access reuses a small sliding window of TLB entries and benefits from reused cached translations and hardware prefetching in the cache. Random access, such as hash table probes, and pointer chasing, does not have the same guarantees and can suffer from unpredictable performance.

Large working sets exhaust TLB capacity. Huge pages (2 MB or 1 GB on x86-64) can allow a single TLB entry to cover orders of magnitude more memory than a standard 4 KB page. The constraint is physical contiguity: huge pages require large, aligned, contiguous blocks of physical memory, which become harder to find as memory fragments over time.

Unmapping pages on a multi-core machine requires cross-core coordination. Each CPU core has its own TLB. When the kernel removes or changes a page table mapping, other cores may still hold the old translation cached. A TLB shootdown sends inter-processor interrupts to all relevant cores, forcing them to flush stale entries before the operation can complete. This is why munmap and mprotect on large regions can be expensive on machines with many cores.

Virtual memory hides the physical topology of memory. On multi-socket NUMA servers, physical memory is divided into nodes, each attached to one socket. Remote memory accesses (those that cross the inter-socket interconnect) are 1.5–3× slower than local ones. The virtual address space makes both look identical. Correct NUMA placement requires co-locating threads with their data and using first-touch initialization, thread pinning, or explicit mbind policies.

Thanks for reading Confessions of a Code Addict! This post is public so feel free to share it.

How PyTorch Generates Random Numbers in Parallel on the GPU

Abhinav Upadhyay — Thu, 18 Dec 2025 10:26:29 GMT

GPUs power modern deep learning models because these models rely on tensor operations, which can be efficiently parallelized on GPUs with their thousands of cores. However, apart from tensor computations, these models also rely on random numbers. For example, to initialize the model weights, during dropout, data sampling, stochastic gradient descent, etc.

So, the question arises: how do frameworks like PyTorch generate random numbers in parallel on GPU devices? Because if random number generation becomes a bottleneck, it can significantly slow down the entire training or inference pipeline.

The answer lies in a clever algorithm called Philox, a counter-based parallel random number generator. In this article, we’ll explore:

Why traditional random number generators don’t parallelize well
How Philox works and what makes it different
How to parallelize random number generation using Philox
PyTorch’s implementation of Philox by dissecting its C++ and CUDA code

By the end, you’ll understand how that simple torch.randn() call efficiently generates millions of random numbers in parallel on your GPU while maintaining perfect reproducibility.

Cut Code Review Time & Bugs in Half (Sponsored)

Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.

Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.

CodeRabbit has so far reviewed more than 10 million PRs, installed on 2 million repositories, and used by 100 thousand Open-source projects. CodeRabbit is free for all open-source repo’s.

Get Started Today

Problem with Traditional PRNGs

Let’s start by developing an intuition about why traditional pseudo random number generators (PRNGs) are sequential and not suitable for parallel hardware, such as GPUs.

A PRNG needs to be able to reproduce the same sequence of random numbers when initialized with a specific seed. A natural way of achieving this is through a state transformation function that takes the current state of the generator as input and produces a new state. As long as the function is deterministic, it is guaranteed that we can reproduce the exact same sequence of numbers starting from the same initial state. Mathematically, it can be expressed like this:

Here, the next state is derived by applying the function f on the current state s_n. As you can see, this is a sequential model where you can’t jump ahead arbitrarily without computing all the previous states, and you can’t shard the generation of the random numbers by distributing the work across threads.

To parallelize the generation of random numbers, we need a different model where we can directly generate the nth random number without having to go through the generation of all the previous n-1 numbers. Mathematically, it should look like this:

Where x_n is the nth random number we wish to generate by applying a function b. Here, we can think of the input n as an integer counter and as such the PRNGs that follow this model are called counter-based random number generators. One such counter-based PRNG is the Philox PRNG, used widely in frameworks such as PyTorch for parallel random number generation on GPUs.

Let’s understand how Philox works.

How Philox Works

The Philox algorithm, short for “Product HI, LOw, with XOR”, is a counter-based PRNG that was designed specifically for parallel computation. It was introduced by Salmon et al. in 2011 as part of the Random123 library. The key insight behind Philox is that we can use a cryptographic-like construction to transform a counter into a pseudorandom number.

The Core Idea: Treating RNG as Encryption

We can think of the counter-based RNG problem this way: we want to take a sequence of integers (0, 1, 2, 3, …) and scramble them so thoroughly that they appear random. This is conceptually similar to what a block cipher does in cryptography, it takes a plaintext message and a key, then produces a ciphertext that looks random.

In Philox’s case:

The counter (n) acts like the plaintext
The seed acts like the encryption key
The output is our pseudorandom number

Philox takes a counter and a key (derived from the seed) as its input and produces a random number as its output

The beauty of this approach is that any thread can independently compute its random number by knowing just two things: which counter value it needs (its position in the sequence) and the seed. No synchronization or communication with other threads is needed.

The Philox Construction

Philox operates on fixed-size inputs and outputs. The most common variant is Philox-4x32, which means:

4: Works with 4 32-bit integers at a time
32: Each integer is 32 bits wide

So Philox-4x32 takes a 128-bit counter (represented as four 32-bit integers) and produces a 128-bit output (four 32-bit random numbers). This is perfect for generating multiple random numbers at once, which is common in GPU workloads.

The algorithm consists of applying multiple rounds of a transformation function. Each round performs these operations:

Multiplication and splitting: Multiply pairs of the input integers and split the results into high and low parts
XOR with keys: XOR certain parts with key-derived values
Permutation: Shuffle the positions of the integers

Let’s break down a single round in detail. Philox-4x32 works with four 32-bit values, which we’ll call (c0,c1,c2,c3). Each round transforms these values through the following steps:

Step 1: Multiply and Split

Take the first pair (c0,c1) and the second pair (c2,c3). Multiply each by a carefully chosen constant:

For Philox-4x32, these constants are:

M0=0xD2511F53
M1=0xCD9E8D57

These constants were chosen through careful analysis to ensure good statistical properties. When we multiply two 32-bit numbers, we get a 64-bit result. We split this into:

High 32 bits: hi(prod)
Low 32 bits: lo(prod)

The multiplication of two 32-bit values c0 and M0 produces a 64-bit result which is split into hi and lo parts

Step 2: XOR with Keys

The high parts are XORed with round-specific keys derived from the seed, and with the other input values:

Here, k0 and k1 are the key values (derived from the seed), and ⊕ represents the XOR operation.

Step 3: Permutation

Finally, we rearrange the values for the next round. The output of one round becomes:

Notice how the values are shuffled: the low parts of the products go to positions 0 and 2, while the XORed high parts are swapped and go to positions 1 and 3.

Multiple Rounds

To achieve good randomness, Philox-4x32 typically applies 10 rounds of this transformation. After each round except the last, the keys are also updated:

Where w0=0x9E3779B9 and w1=0xBB67AE85 are the “Weyl sequence” constants derived from the golden ratio. This ensures that each round uses different key material, increasing the mixing of the input bits.

Visualizing a Complete Philox Transformation

The following diagram shows the complete flow through multiple rounds:

The complete Philox transformation across multiple rounds producing four 32-bit random integers

Why This Works

The Philox algorithm achieves good randomness through several mechanisms:

Multiplication is a non-linear operation that mixes bits effectively. Small changes in input lead to large changes in output.
High-low splitting ensures we use all 64 bits of the multiplication result, not just the lower 32 bits.
XOR operations combine different data streams (keys, previous values) in a way that’s invertible but unpredictable without knowing the key.
Permutation ensures that the mixing effect propagates to all output positions across rounds.
Multiple rounds compound these effects, ensuring that every output bit depends on every input bit in a complex way.

The algorithm has been extensively tested and passes standard statistical tests for randomness like the TestU01 suite, making it suitable for scientific computing and machine learning applications.

Properties of Philox

Before we dive into PyTorch’s implementation, let’s summarize the key properties that make Philox attractive:

Parallel-friendly: A GPU with thousands of cores can generate thousands of random numbers simultaneously, each using a different counter value.
Deterministic: Given the same seed and counter, you always get the same output.
Long period: With a 128-bit counter, you can generate 2^128 random numbers before the sequence repeats numbers, more than enough for any practical application.
Fast: The operations (multiplication, XOR, bit shifting) are primitive operations that run very efficiently on modern CPUs and GPUs.
Memory efficient: The generator state is just the counter and key, requiring minimal storage per thread.

Next, let’s understand how Philox can be parallelized.

Parallelizing Philox: Subsequences and Offsets

Now that we understand how the Philox algorithm works, let’s explore what makes it particularly powerful for parallel computing: the ability to generate random numbers across thousands of threads simultaneously without any coordination.

The Random Number Space

Recall that Philox is a counter-based PRNG. At its core, it’s a function that maps a 128-bit counter to a 128-bit random output:

Given a fixed key (derived from the seed), each unique counter value produces a unique set of random numbers. Since we have a 128-bit counter, we have:

Each counter value produces 4 random 32-bit numbers (since 128 bits = 4 × 32 bits), giving us an enormous space of random numbers. We can visualize this as a huge one-dimensional array:


Counter: 0 1 2 3 ... 2^128-1

↓ ↓ ↓ ↓ ↓

Output: [r₀,r₁,r₂,r₃][r₄,r₅,r₆,r₇][r₈,r₉,r₁₀,r₁₁][r₁₂,...]...[...]

How do we partition this massive space across parallel threads? One approach is to split the counter space between the threads.

Partitioning the Counter Space

The key insight is that we can split the 128-bit counter into two parts and use them to create a 2D address space. Think of the counter as having 4 components of 32 bits each: (c0,c1,c2,c3).

We can partition this as:

Upper 64 bits: Which thread’s region we’re in
Lower 64 bits : The position within a thread’s assigned region

This partitioning scheme gives each thread its own “slice” of the random number space:

Thread 0 gets counters: (∗,∗,0,0) where ∗∗ can be any value
counter = (0,0,0,0) → first 4 random numbers for thread 0
counter = (1,0,0,0) → next 4 random numbers for thread 0
counter = (2,0,0,0) → next 4 random numbers for thread 0
…
Thread 1 gets counters: (∗,∗,1,0)
counter = (0,0,1,0) → first 4 random numbers for thread 1
counter = (1,0,1,0) → next 4 random numbers for thread 1
counter = (2,0,1,0) → next 4 random numbers for thread 1
…
Thread 2 gets counters: (∗,∗,2,0)
counter = (0,0,2,0) → first 4 random numbers for thread 2
And so on…

Terminology: Subsequence and Offset

We now give names to these two parts:

Subsequence: The upper 64 bits of the counter. This identifies which parallel thread or stream we’re referring to. We can have up to 2^64 different subsequences running in parallel.

Offset: The lower 64 bits of the counter. This identifies the position within a subsequence. Each subsequence can generate up to 2^64 sets of random numbers.

Together, they form a coordinate system (s,o) where:

s is the subsequence (which parallel stream)
o is the offset (position in that stream)

The total capacity is:

This matches exactly the size of our original counter space, we’ve simply reorganized it into a 2D structure that’s easy to partition across threads.

How Offsets Increment

When a thread generates more random numbers, it increments the offset portion of the counter. Since Philox generates 4 random numbers at once, we typically increment by 1 each time (remembering that each offset value produces 4 numbers):


Thread 0 subsequence = 0:

offset=0: counter=[0,0,0,0] → Philox → [rand₀, rand₁, rand₂, rand₃]

offset=1: counter=[1,0,0,0] → Philox → [rand₄, rand₅, rand₆, rand₇]

offset=2: counter=[2,0,0,0] → Philox → [rand₈, rand₉, rand₁₀, rand₁₁]

...

The offset is really tracking “which batch of 4” we’re on. If we need the 10th random number (index 9, counting from 0):

Offset = ⌊9/4⌋=2
Position within batch = 19mod4=1
So we use counter [2,0,0,0] and take the second output (index 1)

The Power of Skip-Ahead

One powerful consequence of this design is skip-ahead: a thread can jump directly to any offset without computing intermediate values.


Thread 0:

- Jump to offset 1,000,000: counter = [1000000, 0, 0, 0]

- Generate random numbers at this position

- Jump to offset 5,000,000: counter = [5000000, 0, 0, 0]

- No need to compute offsets 1 through 4,999,999!

This is impossible with traditional sequential PRNGs where state n+1n+1 depends on state nn.

Setting Up for PyTorch

Now that we understand how the counter space is partitioned, we can see how PyTorch uses this:

When PyTorch generates random numbers on a GPU:

It launches many threads (e.g., 1024 threads)
Each thread is assigned a unique subsequence number (typically its thread ID)
Each thread starts at offset 0 within its subsequence
As each thread generates random numbers, it increments its offset
PyTorch tracks the global offset to ensure future operations don’t reuse the same counters

With this foundation, let’s now explore how PyTorch implements these concepts in its Philox engine.

Philox Implementation in PyTorch

PyTorch uses Philox-4x32-10 (4 values of 32 bits, 10 rounds) as its primary PRNG for CUDA operations. The implementation lives in aten/src/ATen/core/PhiloxRNGEngine.h and is designed to work on both CPU and GPU (via CUDA). Let’s dissect this implementation to understand how the theoretical concepts we discussed earlier translate into actual code.

Core Data Structures

The implementation starts by defining some type aliases for clarity:


typedef std::array UINT4; // Four 32-bit integers

typedef std::array UINT2; // Two 32-bit integers

typedef std::array DOUBLE2; // Two doubles

typedef std::array FLOAT2; // Two floats

These typedefs make the code more readable. UINT4 represents the 128-bit counter or output (4 × 32 bits = 128 bits), while UINT2 represents the 64-bit key (2 × 32 bits = 64 bits).

The PhiloxEngine Class Structure

The philox_engine class maintains four critical pieces of state:


private:

detail::UINT4 counter_; // 128-bit counter (c₀, c₁, c₂, c₃)
detail::UINT4 output_; // Cached output from last round
detail::UINT2 key_; // 64-bit key derived from seed (k₀, k₁)
uint32_t STATE; // Position in current output (0-3)

Let’s understand each field:

counter_: This is the 128-bit counter that gets incremented and transformed through the Philox rounds. It’s divided into four 32-bit components:

counter_[0] and counter_[1]: Lower 64 bits represent the offset (which random number in the subsequence)
counter_[2] and counter_[3]: Upper 64 bits represent the subsequence (which parallel stream)

key_: The 64-bit key derived from the seed. This remains constant for a given seed and is used in the XOR operations during each round.

output_: Philox generates 4 random 32-bit numbers at once. This field caches those numbers so we don’t have to recompute them for every call.

STATE: A simple counter (0-3) that tracks which of the four cached output values to return next. This is an optimization to avoid regenerating when we have unused random numbers.

Initialization and State Management

The constructor initializes the engine with a seed, subsequence, and offset:

The philox_engine constructor definition

The C10_HOST_DEVICE macro is crucial here, it tells the compiler that this function can run on both the CPU (host) and GPU (device). This allows the same code to be used in both contexts.

Let’s look at how reset_state sets up the initial state:

The reset_state function that resets the state of the philox_engine

This initialization strategy is clever:

The seed is split into the two key components key_[0] and key_[1]
The subsequence goes into the upper half of the counter (counter_[2] and counter_[3])
The offset (lower half of counter) starts at zero but can be set later via incr_n(offset)

This design allows for massive parallelism. Imagine running 1024 CUDA threads simultaneously:


Thread 0: subsequence=0, offset=0 → counter = [0, 0, 0, 0]

Thread 1: subsequence=1, offset=0 → counter = [0, 0, 1, 0]

Thread 2: subsequence=2, offset=0 → counter = [0, 0, 2, 0]

...

Thread 1023: subsequence=1023, offset=0 → counter = [0, 0, 1023, 0]

Each thread has a unique counter value from the start, so they all generate independent random sequences without any coordination.

The Core Algorithm: Single Round

Now let’s examine the heart of the Philox algorithm—the single_round function:

The single_round function that implements one round of Philox

Let’s break this down step by step, mapping it to our earlier theoretical description:

Step 1: Multiply and Split

uint32_t lo0 = mulhilo32(kPhiloxSA, ctr[0], &hi0);
uint32_t lo1 = mulhilo32(kPhiloxSB, ctr[2], &hi1);

Here we multiply:

ctr[0] by kPhiloxSA (the constant 0xD2511F53)
ctr[2] by kPhiloxSB (the constant 0xCD9E8D57)

The mulhilo32 function performs the multiplication and splits the 64-bit result:

Returns the low 32 bits (lo0 or lo1)
Stores the high 32 bits in the passed pointer (hi0 or hi1)

Let’s look at mulhilo32 itself:

The definition of the mulhilo32 function

This function has two implementations:

On CUDA (GPU): Uses the intrinsic __umulhi which directly computes the high 32 bits of a multiplication. This is extremely fast on GPU hardware.

On CPU: Promotes both operands to 64 bits, multiplies them, then extracts high and low parts manually via shifting and casting.

Here’s what happens mathematically:

Step 2: XOR and Permute

ret[0] = hi1 ^ ctr[1] ^ in_key[0];
ret[1] = lo1;
ret[2] = hi0 ^ ctr[3] ^ in_key[1];
ret[3] = lo0;

Notice the pattern:

ret[0]: Takes hi1 (high bits from second multiplication), XORs with ctr[1] and in_key[0]
ret[1]: Simply uses lo1 (low bits from second multiplication)
ret[2]: Takes hi0 (high bits from first multiplication), XORs with ctr[3] and in_key[1]
ret[3]: Simply uses lo0 (low bits from first multiplication)

Let us visualize this transformation:

Visualization of the operations performed during a single round of Philox

This permutation ensures that bits from different positions get mixed together in subsequent rounds.

Constants: The Magic Numbers

You might wonder where these constants come from:


static const uint32_t kPhilox10A = 0x9E3779B9; // Weyl sequence
static const uint32_t kPhilox10B = 0xBB67AE85; // Weyl sequence
static const uint32_t kPhiloxSA = 0xD2511F53; // Multiplier
static const uint32_t kPhiloxSB = 0xCD9E8D57; // Multiplier

Weyl sequence constants (kPhilox10A and kPhilox10B): These are derived from the golden ratio. The constants are:

The golden ratio has special properties that make it useful for distributing values uniformly. These constants are added to the key after each round to ensure different key material is used.

Multiplier constants (kPhiloxSA and kPhiloxSB): These were carefully chosen through empirical testing to maximize statistical quality. They need to have good bit-mixing properties when multiplied with typical counter values.

Running Multiple Rounds

The rand function orchestrates running all rounds:

Definition of the rand function that applies multiple rounds of Philox to produce random numbers

This is straightforward:

Run n_rounds - 1 iterations where we:
1. Apply single_round to transform the counter
2. Update the key by adding the Weyl constants
Apply one final round without updating the key

By default, PyTorch uses 10 rounds (n_rounds = 10), which provides a good balance between performance and statistical quality.

Generating Random Numbers: The Operator

The operator () is what users call to get random numbers:

Definition of the operator() that is called by users to generate random numbers

This function is clever in its efficiency:

Check if we need new random numbers: if(STATE == 0) checks if we’ve exhausted the previous batch. Remember, STATE cycles through 0, 1, 2, 3.

Generate a batch: When needed, it:

Runs the full Philox algorithm via rand(counter, key, n_rounds)
Stores the result in output_ (four 32-bit random numbers)
Increments the counter for next time via incr()

Return next value: Grab the current position from output_, then advance STATE.

The line STATE = (STATE + 1) & 3 is a bit trick equivalent to STATE = (STATE + 1) % 4, using bitwise AND since 3 is binary 11.

This batching strategy is a significant performance optimization. Instead of running Philox for every random number, we run it once per four random numbers.

Counter Increment Logic

The counter increment operations deserve special attention because they handle the 128-bit arithmetic correctly. Let’s start with the simple case:

Definition of the incr function that increments the counter

This increments the 128-bit counter by 1. The logic is:

Increment counter_[0] (least significant 32 bits)
If it’s non-zero after increment, we’re done (no overflow)
If it overflowed to zero, carry to counter_[1]
Continue propagating carries until we find a non-zero result

The more complex function is incr_n, which increments by an arbitrary 64-bit value:

Definition of incr_n function that increments the counter by an arbitrary 64-bit value

This function is more intricate because it needs to:

Split the 64-bit increment n into nlo and nhi
Add nlo to counter_[0]
Detect overflow by checking if counter_[0] < nlo (if the result is less than what we added, overflow occurred)
If overflow, increment nhi to carry over
Add nhi to counter_[1] and check for overflow again
If still overflowing, propagate to the upper 64 bits

The overflow detection counter_[0] < nlo is a standard technique in multi-precision arithmetic. After adding, if the result is less than one of the operands, an overflow must have occurred since we’re working with unsigned integers.

Converting to Floating Point

For machine learning applications, we often need floating-point random numbers in the range [0, 1), while Philox gives us integers. So, PyTorch applies a conversion function:

Definition of the uint32_to_uniform_float function that converts a 32-bit integer to a float value in the range [0,1)

This function is carefully designed:

Mask off sign bit: value & 0x7FFFFFFF clears the highest bit, giving us values from 0 to 2^31−1

Scale down: Multiplying by scale = 4.6566127342e-10 maps these integers to floats in [0, 1).

The scale factor is calculated as:

Why use only 31 bits instead of all 32? Because:

We want only positive values (for [0, 1) range)
The highest representable float less than 1.0 needs careful handling
Using 31 bits avoids potential rounding issues near 1.0

Normal Distribution Generation

The randn function generates normally distributed random numbers using the Box-Muller transform:

Definition of the randn function that generates random numbers from a normal distribution

The Box-Muller transform converts two uniform random variables U1,U2∼Uniform(0,1) into a normal random variable Z∼N(0,1):

Memory Layout and Efficiency

One of the beauties of this implementation is how compact the state is. Each philox_engine instance requires:


counter_: 4 × 4 bytes = 16 bytes

output_: 4 × 4 bytes = 16 bytes

key_: 2 × 4 bytes = 8 bytes

STATE: 4 bytes = 4 bytes

Total = 44 bytes

This is tiny! On a GPU, you could have millions of these generators running in parallel, each consuming only 44 bytes. In comparision, traditional RNGs can take kilobytes of state per instance.

Summary

In this article, we explored Philox, a counter-based PRNG designed for parallel computing environments. We learned:

Why traditional PRNGs don’t parallelize well: Sequential state dependencies create bottlenecks on parallel hardware like GPUs.
How Philox works: By treating random number generation as a function f(counter, key), Philox allows direct computation of any random number without computing predecessors.
The algorithm’s core operations: Multiplication with carefully chosen constants, high-low splitting, XOR with key material, and permutation, repeated for 10 rounds to ensure statistical quality.
Parallelization through counter partitioning: The 128-bit counter space is split into subsequences (upper 64 bits) and offsets (lower 64 bits), allowing up to 2^64 parallel threads each generating 2^64 random numbers.
PyTorch’s implementation: A compact 44-byte state per engine instance, efficient batching of 4 numbers at a time, and careful handling of counter arithmetic for both CPU and GPU execution.

x86 Addressing Modes, Part 1 — Immediate and Direct Access

Abhinav Upadhyay — Wed, 12 Nov 2025 16:15:23 GMT

Welcome back to our series on x86 assembly programming. If you are new, you can check out the series overview.

So far, we have learned the fundamentals of instructions and registers in x86 assembly. But writing real-world programs requires memory access, so we must learn how to deal with memory. If you can master this topic, you level up as a programmer.

There are two kinds of memory where we can keep our program’s data: registers and main memory. We have already learned about using registers; they are the fastest possible memory units in the hardware. But they are very limited in numbers, while real-world code needs much more memory than that.

Apart from that, registers can only hold primitive type values. The integer registers (the 16 general-purpose ones we learned about) handle integers, while separate floating-point registers exist in x86 for floating-point operations. However, we need a way to store and access composite types, such as arrays and structs, which is only possible using main memory.

Accessing memory in assembly is a big topic, so we’ll split it into a multipart series covering each addressing mode step by step as there are several memory addressing modes, and learning to effectively use each of them is crucial for us to read and write assembly code. So, I am going to split this topic into a multipart series. In this first part, we will cover the following topics:

Regions of memory in a process’s address space: stack, heap, and data
Immediate addressing mode
Direct addressing mode

In future parts, we will cover the following:

Indirect addressing mode
Offset-based addressing mode
Indexed addressing mode

We’ll start by understanding how data is organized in memory before we explore addressing modes. Now, let’s dive in!

I’m also publishing this in the form an ebook (PDF). If you don’t wish to upgrade to a subscription, you can purchase the PDF using the following link. If you are a paid subscriber, you can get it at a discount (monthly subs: 20% and annual subs: 50%). Please email me for the discounted link.

Purchase PDF

Regions of Memory in Process Address Space

When programming in high-level languages, you would have learned about the concept of scope or the lifetime of a variable. For example, a global variable lives for the duration of the program; local variables are automatically destroyed when the function returns. And, you can dynamically allocate memory on the heap that lives until it is freed.

When programming in assembly, we need similar scopes. However, there is no compiler to help us out, so we must do it ourselves. These scopes can be achieved by storing data in different regions in the address space of the process. So, we must start there.

There are three main regions in the process’s address space where you can decide to store your program’s data, as shown in the following diagram.

Key regions in the address space of a process: stack, heap, and data

Stack segment: The stack segment is primarily used to implement function calls and to store function local data, such as variables and arguments. We will learn to use the stack when we talk about functions in assembly.
Data Segment: The data segment is used to store static data. For example, whenever you create global variables or constants in your programs, the compiler may put them in the data segment. The advantage of the data segment is that it is burned as part of the program binary and loaded during startup. As a result, there is no memory allocation overhead at runtime.
Heap Segment: The heap segment is used for dynamic memory allocation at runtime. For example, when growing an array, or creating nodes for a tree or a linked list.

In this article, we will mostly use the data segment, and we’ll cover heap and stack in future articles on dynamic memory allocation and function calls.

But, before jumping to memory access modes, we should spend a few minutes to learn how to do static memory allocation in the .data section, as we will be using static memory throughout the rest of this article.

Static Memory Allocation in the .data section

The data segment in the process’s address space is populated based on the contents in the .data section of the executable binary. When we want to create static data in our program, such as global variables or constants, we can put them in the .data section of our program.

To create a static value in the .data section, we need to do three things:

Create a label: At the time of writing assembly, we don’t know the exact memory address of the values or instructions, so we must use labels. At linking time, the linker replaces labels with the final addresses in the object code that it generates. So, creating a label for the value gives us a way to refer to its address.
Declare the size: We need to tell the assembler the size of the value, so that it can create that much space in the .data section. If you read the article on registers, you may recall that we have the following sizes:
- .quad: For 8-byte values
- .long: For 4-byte values
- .word: For 2-byte values
- .byte: For single-byte values
- Apart from these, we also have the .asciz macro to create a nul-terminated ASCII string.
Declare the value: Finally, provide the value.

The following example shows how we can create an 8-byte integer value in the .data section with the label ANSWER_TO_LIFE:

Syntax for allocating data in the .data section

This example allocates a single 64-bit value, but it is also possible to create more complex structures. For instance, we can create a struct-like object as shown in the example below:

A Systems Engineer’s Guide to Benchmarking with RDTSC

Abhinav Upadhyay — Thu, 23 Oct 2025 11:31:05 GMT

Cover image: Depicting rdtsc, lfence and rdtscp instruction with a backdrop of a clock

Performance is critical for systems programmers, and accurate benchmarking is the foundation of meaningful optimization. To truly understand where your code spends time, you need precise and low-overhead measurements, especially when a piece of code may execute in just a few hundred CPU cycles.

Most developers reach for familiar high-level timers, such as Python’s time.perf_counter() or Java’s System.currentTimeMillis(). These are convenient but rely on system calls like clock_gettime which introduce hundreds of cycles of overhead. In certain situations, this overhead can be too much. And when profiling production systems, you want the overheads to be as minimal as possible.

We need a way to read time directly from the hardware, without leaving the user space. On x86 systems, that mechanism is the rdtsc instruction. It gives us near-zero-overhead access to the CPU’s internal timestamp counter, but using it correctly requires an understanding of how modern processors execute instructions.

In this article, we’ll learn how to use rdtsc to do benchmarking. Specifically we will cover the following topics in detail:

What rdtsc does: How it reads the CPU’s internal timestamp counter and why it provides near-zero-overhead timing.
Understanding CPU behavior: How out-of-order execution can distort timing results and why instruction ordering matters.
Instruction stream serialization: What it means, how the CPU reorders instructions, and how serializing instructions (like cpuid) enforce strict ordering.
Memory fences: How lfence, sfence, and mfence provide lighter-weight ordering guarantees that help isolate measurement code.
Combining it all: Practical example of using these mechanisms together to obtain stable and reproducible timing measurements.

By the end, you’ll know not only how to use rdtsc safely and accurately but also why these extra steps are essential for meaningful microbenchmarking.

Understanding The Timestamp Counter in the CPU

In the x86 architecture, every CPU comes with a special 64-bit counter, called the timestamp counter (TSC) that gets incremented at a fixed frequency. If you can read the value of the counter before and after the execution of a block of code, you can accurately tell how many cycles that code took to execute.

When the counter overflows, it resets to 0. However, because it is a 64-bit counter, it will take an extremely long time for it to overflow. For instance, if the counter increments at 1 GHz frequency, it will take 585 years for it to overflow.

The frequency at which the timestamp counter increments is not the same as the real CPU frequency. In the past, it used to be related to the CPU frequency but as recent CPUs started to have dynamic frequency scaling, the timestamp counter was made to tick at a fixed constant frequency to get stable measurements. For example, some of the cores on my laptop have a frequency range of 800 MHz to 4800 MHz, but the TSC ticks at 2.3 GHz.

So, how do we read the TSC? The x86 instruction set provides two instructions for doing this: rdtsc and rdtscp. But to actually measure the timing of a block of code using these is not as simple as simply slapping rdtsc before and after the code block. It is more sophisticated than that. In practice, it looks like the following code snippet:

#include 

uint32_t cpuid;
_mm_lfence();
uint64_t start = __rdtsc();

for (int i = 0; i < ITERS; i++) {
  // expensive loop body
}

uint64_t end = __rdtscp(&cpuid);
_mm_lfence();
uint64_t ncycles = end - start;

In this snippet, I have used the GCC compiler intrinsics __rdtsc and __rdtscp for invoking the rdtsc and rdtscp instructions respectively. But you may ask, what is the significance of using _mm_lfence() before and after the measurement? You may also question why we used rdtsc for reading the starting value of the TSC and rdtscp for the ending measurement. To answer these questions, we have to go deeper and think about how the processor executes instructions.

Out of Order Execution and Serializing Instructions

Let’s step back a bit and talk about how the CPU executes instructions.

Modern x86 CPUs do out-of-order execution of the instruction stream to execute multiple instructions in parallel. They do this by looking at a window of instructions in the instruction stream, identifying independent instructions and executing them in parallel. As a result, an instruction that appears later in the program order may execute much earlier than its predecessors.

For example, imagine an instruction stream as shown in the below snippet. Here, we are interested in measuring the time taken to execute instructions I4 to I6, so we have inserted an rdtsc instruction after I3 and I6.

I0, I1, I2, I3, rdtsc, I4, I5, I6, rdtsc,...

Due to the out-of-order nature of the instruction execution, we cannot guarantee if the rdtsc instructions will execute exactly in the right order. It is possible that the CPU executes the first rdtsc after I1. In that case, our measurement will include the timing of I2 and I3 as well, which is not what we want.

We need a way to force the CPU to not execute rdtsc out of its order and also ensure that all the previous instructions have finished executing when it executes rdtsc. This can be achieved by forcing serialization of the instruction stream right before rdtsc, let’s understand what that means.

Serializing the Instruction Stream

There are certain instructions in the x86 architecture that force serialization of the instruction stream. Basically, the serializing instruction acts like a barrier. The CPU cannot execute it until all the instructions appearing before it in the program have finished. Also, it cannot begin executing any instruction appearing after the serializing instruction until the serializing instruction has finished.

To be precise, a serializing instruction also requires that all the flags, registers and memory modifications must finish before it executes and all the CPU buffers must be drained.

So, if we insert such a serializing instruction before rdtsc, then we can guarantee that the rdtsc instruction will not be executed by the processor out of its actual order.

There are a few such serializing instructions available in the x86 architecture, such as:

serialize: Serializes the instruction stream

cpuid: used to identify the CPU model and features

iret: returns control from an interrupt handler back to the interrupted application

rsm: resume from system management mode

Out of these, iret and rsm are control flow modifying instructions, so you cannot use them solely for the purpose of serializing the instruction stream. In the past, cpuid was the recommended instruction for use in combination with rdtsc, and it is still an option today. However, it adds a slight overhead because the CPU needs some work to do to execute it apart from serializing the instruction stream. A much lightweight alternative is the lfence instruction that we saw in the snippet above. lfence is not a proper serializing instruction, but a memory ordering instruction. However, it serves the purpose. Let’s understand what it does.

We didn’t consider the serialize instruction because it is only available on Intel processors and missing on AMD. The instruction is purely there for serializing the instruction stream, so it is a good option. Alas, it is not portable.

The `lfence` instruction

An alternative to using serializing instructions with rdtsc is using memory ordering instructions, such as lfence, sfence, or mfence. These instructions add lesser overhead than pure serializing instructions, such as cpuid. Let’s understand how.

My Top 5 Favourite Features in Python 3.14

Abhinav Upadhyay — Sat, 11 Oct 2025 08:45:09 GMT

The Pi release of Python (so named because it is version 3.14, matching the digits of π) is finally here. You can go through the list of new features and major changes yourself release notes. In this post, I want to go through my top 5 favorite features of this release that I find exciting as a Python programmer and also as an engineer who loves studying system internals.

The Free Threading Python

In practical terms, the free‑threaded build allows Python programs to take advantage of multiple CPU cores concurrently, enabling true parallel execution of threads for compute‑intensive workloads.

Until Python 3.13, it was not possible to run multiple threads in parallel in Python due to the global interpreter lock (GIL), which is a global mutex inside the Python interpreter. A thread needs to acquire this lock before it can be run on the CPU. It meant that even if you had a large multicore machine, your Python process was still only using a single core. Solutions like multiprocessing were created as a workaround this limitation.

Prior to the Python 3.13 release of Python, PEP-703 was proposed to make the GIL optional. The PEP proposed a plan to introduce changes so that it would be possible to build a version of Python without the GIL by specifying a build-flag.

These changes were accepted in the 3.14 release and as a result this release of Python comes with two versions: one with the GIL still there, while the other without the GIL. If you use uv , you can install the two versions using these commands:

uv install cpython-3.14.0 #with the GIL
uv install cpython-3.14.0t #without the GIL

Note that the free threaded build of Python breaks the ABI and all the third party packages that use the C API of CPython need to be recompiled, so not all the scientific computing packages may be immediately available for use with it.

Reference Reading

The PEP-703 which describes the work behind removing the GIL is a gread read to understand the challenges behind removing the GIL and how this work has been done.

Concurrent Interpreters

A very exciting new feature in the 3.14 release is the introduction of the concurrent.interpreters module in the standard library. It allows you to run multiple Python interpreters in parallel within the same Python process. It enables yet another kind of parallelism in Python despite the GIL.

The actual implementation details behind this are tricky to explain, I will do that in another post. But if you have read my article on CPython runtime bootstrapping, you might be able to put the pieces together. But here is the executive summary.

By default, the Python process has one main interpreter and one main thread. But now, you have the ability to create multiple interpreters on demand at runtime using the concurrent.interpreters module. These other interpreters created at runtime are also referred to as subinterpreters. Creating a subinterpreter is as easy as calling the create() function of concurrent.interpreters.

import concurrent.interpreters
interp1 = concurrent.interpreters.create()

After the above call, the Python process has two interpreters inside it. Internally, the runtime tracks these using a linked list of interpreter state objects. An interpreter state represents the internal execution state of an interpreter. By providing each interpreter its own interpreter state, the runtime isolates them at Python code execution level.

To execute code on this new interpreter, we can invoke its call() method. For example:

>>> def sum(a,b):
...     return a + b
...
>>> interp1.call(sum, 10, 2)
12

However, this isn’t parallel execution because there is only one thread running in the Python process. So, the runtime simply switches the thread from executing the code inside the main interpreter to executing code inside the subinterpreter.

To execute code on the interpreter in its own thread, we can use the call_in_thread() method. Internally, this creates a new thread that executes the code in its own context. This is a non-blocking call and we cannot get the result back. So, to communicate data between interpreters, we have to create a queue using concurrent.interpreters.create_queue() method. Here is an example that puts all of this together.

>>> def add(q, a, b):
...   q.put(a+b)
...
... interp1 = concurrent.interpreters.create()
... queue = concurrent.interpreters.create_queue()
... t = interp1.call_in_thread(add, queue, 10, 2)
... result = queue.get()
... print(result)
...
12

Here, we have created a queue, and passed it to the add function. The add function puts the result in the queue. In the main interpreter, we poll the queue for the result using its get() method, which blocks until there is some data in the queue.

If you are curious about how all of this works under the hood, let me know and we can cover the internals in a future post.

Reference Reading

If you want to learn more about the runtime data structures behind this, I recommend the following article:

Remote Debugging Support

Beyond concurrency, Python 3.14 also introduces major improvements in tooling.

Debugging running Python processes has always been a pain. In order to debug it using a debugger, such as pdb, you need to manually add breakpoints in the code, then restart the process and wait for them to be hit again. In production systems, this can be infeasible.

The motivation for the new feature is to simplify this experience: with Python 3.14, you can attach to a running process using python -m pdb -p , eliminating the need to restart it.

Technically, the CPython interpreter already had provisions to allow remote processes to connect to it and navigate its runtime state. This is how remote profilers, such as scalene, pyspy and others work. As part of PEP-768, this framework has been extended to allow debuggers to connect and debug the Python interpreter.

A debugger can now attach to a Python process and update specific fields in its runtime data structures to signal that it wants to begin debugging. When the interpreter detects this, it provides a debug prompt where you can set breakpoints and debug as usual.

While pdb has already been updated to support remote debugging, this framework also exposes an API, sys.remote_exec, so external debuggers can leverage this functionality without needing low-level C integration.

Reference Video

In a past live session, I talked about how remote profilers work which is exactly how remote debugger implementation has also been done. So, if you are curious, give it a watch.

Incremental Garbage Collection

Complementing the concurrency and debugging improvements discussed earlier, this feature enhances runtime stability and responsiveness by addressing garbage collection performance.

In a past article, I explained in detail the cost of a full heap scan by the garbage collector in CPython. Needless to say it is expensive, and moreover, it also introduces unpredictable latency delays in the performance of your APIs, because when the GC is running, the interpreter does not execute any Python code. Incremental garbage collection makes the GC overhead predictable, resulting in smoother performance for latency-sensitive workloads.

Let’s first understand how the GC used to work before this change. There were three collectable generations: young generation, old generation, and the oldest generation. There were configurable thresholds for each generation that would define when the GC would scan each of those generations. For example, the young generation would be scanned once the number of objects in it exceeds 10,000.

Any object that survives a scan of the young generation gets promoted to the first old generation. The first old generation gets scanned when the young generation has been scanned a configured number of times, such as 10 times. When that happens, the GC scans both the young gen and the first old gen. Any object that survives a scan of the first old generation gets promoted to the 2nd old generation (also known as the oldest generation).

The oldest generation is scanned when the first old generation has been scanned a configured number of times. When that threshold is reached, the GC performs a full heap scan, i.e. all the three generations. Naturally, this gets expensive.

Incremental garbage collection improves this. It reduces the number of GC generations to just two: young and old. On each GC cycle, the collector scans the young generation and a fraction of the old generation. This way, the amount of work that the GC does on each cycle becomes consistent and it eliminates those long pauses and latency spikes that were there due to a full heap scan.

Reference Reading

If you want to read more about CPython’s garbage collector, I recommend the following articles:

Tail Calling Interpreter

Finally, my favorite change as part of this release is the tail calling interpreter. It is a rewrite of the bytecode dispatch loop in the CPython virtual machine and improves performance of Python code execution by ~5%.

The bytecode dispatch loop is the heart of the interpreter where the bytecode instructions of your compiled Python program are evaluated. The faster this loop runs, the faster your Python program executes, so performance improvement in this are are always very exciting to understand. I have already written a very detailed article on the design and implementation of the dispatch loop in CPython, and I have another article in progress to explain the tail calling interpreter. So, I will be brief here.

Your Python program gets compiled to a sequence of bytecode instructions. For example, the following snippet shows the bytecode instructions for a single line of code: a + b. So, the bytecode dispatch loop iterates over these instructions one by one and executes them.

>>> import dis
>>> dis.dis(”a + b”)
  0           0 RESUME                   0

  1           2 LOAD_NAME                0 (a)
              4 LOAD_NAME                1 (b)
              6 BINARY_OP                0 (+)
             10 RETURN_VALUE

The most obvious way of writing this loop is using a switch case. The problem with that is that Python has hundreds of bytecode instructions, so this switch case is huge. Optimizing such large functions is hard for compilers. For example, it cannot allocate registers optimally and some of the key variables can get spilled onto the stack, resulting in poor performance.

CPython also has a computed goto based implementation of the dispatch loop but that also suffers from the same problem. If you are not familiar with computed goto based dispatch loop, read my article on the design and implementation of the CPython dispatch loop.

The tail calling interpreter solves this by separating the implementation of each bytecode instruction into an individual function. For example, there is one function for handling LOAD_NAME, another for BINARY_OP, and so on.

This implementation is called tail calling interpreter because of the way these functions are written. At their end, instead of returning, these functions call the function for the next bytecode instruction. They do this by looking up a function pointer table using the next bytecode instruction as an index. The signature and return value of each of these functions is identical, and because these calls occur at the end of the function, they are tail calls. The compiler can optimize these tail calls and convert them into jumps, which avoids the overhead of function calls.

This implementation improves performance due to one fundamental reasons:

It results in small functions for handling each bytecode instruction that the compiler can optimize much better and do optimal register allocation.

Overall, this has shown improvement over the previous switch case and computed goto based implementations. However, it requires compiler support for performing tail call optimization which is not present in all compilers. As a result, right now the feature is opt-in only and you need to build CPython from source using a supported compiler, such as clang 19.

Reference Reading

If you want to understand the internals of the CPython bytecode interpreter and the dispatch loop, read the following article:

Wrapping Up

Although there are many other new features and improvements in this release of Python, I picked these because of my interest in Python internals and performance. Apart from that, changes such as remote debugger and GIL removal are also very exciting to understand from an engineering point of view. Studying these can give you insights that can help you improve as an engineer.

I have plans to write about some of these in future posts. But if you would like me to cover something specific, let me know.

Understanding Weak References in Python

Abhinav Upadhyay — Tue, 30 Sep 2025 15:01:25 GMT

Cover: Strong reference vs Weak Reference

When working with Python (and many other languages), you often rely on the runtime to manage memory for you. Most of the time this works invisibly, but certain patterns such as objects that reference each other in cycles, long lived caches, or subscriber lists can create memory leaks if not handled carefully.

This happens because Python always creates strong references to objects, which means the object will be kept alive as long as all such strong references exist in the program. But when used in cyclic data structure, or in caches, these strong references can unnecessarily delay the deallocation of these objects.

Weak references provide a way to refer to objects without preventing them from being garbage collected. They let you build caches that automatically empty, subscriber lists that clean themselves up, and other data structures that will not accidentally extend object lifetimes.

In this article we will explore what weak references are, why they matter, and how to use them in Python. We will start with a review of reference counting, look at its limitations, and then dive into weak references and their practical uses.

CodeRabbit: Free AI Code Reviews in CLI (Sponsored)

CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code

As developers increasingly turn to CLI coding agents like Claude Code for rapid development, a critical gap emerges: who reviews the AI-generated code? CodeRabbit CLI fills this void by delivering senior-level code reviews directly in your terminal, creating a seamless workflow where code generation flows directly into automated validation. Review uncommitted changes, catch AI hallucinations, and get one-click fixes - all without leaving your command line. It’s the quality gate that makes autonomous coding truly possible, ensuring every line of AI-generated code meets production standards before it ships.

Get Started Today

A Review of Reference Counting

Many languages either use reference counting as a mechanism to manage runtime memory or they provide first class primitives to do use reference counting.

In this scheme, every object has an associated reference count which means the number of places it is being used. For example, when you create an object and assign it to a variable it will have a reference count of 1. When you assign it to another variable or pass it to another function, its reference count will go up by 1.

Similarly, when a variable goes out of scope, or a function call returns then its reference count gets decremented. If the reference count of the object reaches 0, it gets deallocated or garbage collected.

CPython uses reference counting for managing the memory of its runtime. But other languages also offer it as well. For example, in C++ or rust when you use a smart pointer, it uses reference counting under the hood, the compiler generates code that increments and decrements the reference count of the objects.

If you want to understand how CPython implements reference counting internally, you can check out my article on that topic:

Limitations of Reference Counting

Reference counting works well for most cases, but it is not a complete solution. Its simplicity comes with trade‑offs, and understanding these limitations helps motivate why Python also offers weak references.

One of those limitations is cyclic references. Cyclic references exist when objects hold references to each other in a cycle, e.g. in a graph data structure. But you can also end up creating cyclic references accidentally in complex systems. In such cases, the objects that are part of the cycle will never get freed until the cycle is broken. This is why CPython also implements a cycle breaking garbage collector (GC) that runs periodically, scans the objects for cycles and if it detects cycles that are no longer referenced from anywhere else, then it breaks them so that those objects can be freed.

Cyclic references can be problematic for performance because memory usage remains high until the GC runs, and the GC scan itself can be expensive (depending on the number of objects it needs to scan).

We can understand this with the help of an example. Consider the following code

Demonstration of how reference counting works in Python

Let’s break it down:

The MyNode class implements a linked list node with a next field.
print_node_objects is a utility function. It finds all the MyNode objects that are currently alive and then prints their referrers, i.e., who is holding a reference to them.
- It uses gc.get_objects() to get the list of all the currently alive objects in the Python interpreter and filters it down by checking for their type and selecting only MyNode type objects.
- It finds the referrers to an object by using the gc.get_referrers() method which returns a list of referrer objects. We are filtering this list by type because during the call, the gc module itself becomes a referrer and we want to filter it away.
In the main function we call the test1() function that creates two MyNode objects, prints their reference counts and returns. After returning from test1, we call print_node_objects() to see if there are any MyNode type objects that are still alive.

If you run this program, you should see an output like the following:

➜ uv run --python 3.13 --  cycles.py
n1 refcount: 2
n2 refcount: 2
n1 is being deleted
n2 is being deleted
No MyNode objects found

This is pretty much the expected output, but let’s spend a moment to ensure we don’t miss anything.

We see that the reference count for both n1 and n2 is 2. You might expect it to be 1 but it is 2 because during the call to sys.getrefcount, the object’s reference count gets incremented.
We see that the __del__ method of both the object gets called and prints a message. This happens because n1 and n2 are local variables inside test1(), and when it returns, its stack frame gets destroyed which results in the reference counts of all of its local objects (parameters and locally created variables) being decremented. In this case, because n1 and n2 reached reference count 0, they were deallocated and their __del__ method was called.
Finally, in main(), when print_node_objects() is called, we see that it does not find any MyNode objects on the heap that are still alive.

Next, we can do another test that creates a cycle between n1 and n2 and see that the objects stay alive after the return from the test function. The following figure shows the updated code where I’ve added a new function test2() and then calling it from main.

Demonstration of cyclic references. In test2() function we create cycle between n1 and n2 and see that they are left alive even after test2 returns.

If we run this program, we should see the following output:

➜ uv run --python 3.13 --  cycles.py
n1 refcount: 2
n2 refcount: 2
n1 is being deleted
n2 is being deleted
No MyNode objects found
---------------------
n1 refcount: 3
n2 refcount: 3
n1 exists with referrers: [’n2’]
n2 exists with referrers: [’n1’]
n1 is being deleted
n2 is being deleted

Let’s focus on the output after the call to test2().

We see that in test2(), the reference count for n1 and n2 is 3, one higher than what it was in test1(). This is due to n1.next creating a reference to n2 and n2.next creating a reference to n1.
We also see that when test2() returns, the __del__ method of n1 and n2 is not called, it means that those objects are not deallocated and are still alive. This happened because during the return, the interpreter would decrement their reference count but this time the reference count does not reach 0.
After return from test2(), when we call print_node_objects(), we see that it tells us that the MyNode objects we created for n1 and n2 are still alive. We can also see that they are alive because they are holding cyclic reference to each other.
n1 and n2 finally get destroyed as the program ends because the CPython interpreter runs the GC before shutting down.

To avoid such cyclic references from leaking memory, CPython includes a garbage collector that periodically runs, detects cycles that are no longer from anywhere else, and breaks them so that the objects that are part of the cycle can get deallocated. You can verify it yourself by inserting a gc.collect() call after the call to test2() in the above program.

If you want to understand how the CPython garbage collector detects and breaks cycles, read my article on its internals:

However, there are other ways to avoid such pitfalls of reference counting and weak references is one of them. Let’s understand what they are and how they work.

Understanding Weak References

Weak references are on the opposite spectrum of strong references. A weak reference does not increase the reference count of the underlying object, so it enables you to use an object without prolonging the lifetime of the object.

When the object’s reference count goes to 0, it can get deallocated even if there are weak references to it that are still being used. Naturally, this requires that when using a weak reference to an object, we always need to check if the underlying object is still alive.

In Python, to create weak references, we need to use the weakref.ref() function from the weakref module and pass the object for which we want to create a weak reference. For example:

n1_weakref = weakref.ref(n1)

weakref.ref() creates a weak reference to the given object and returns us a callable. To access the underlying object we need to invoke this callable everytime. If the object is still alive, it returns a handle to the object, otherwise it returns None. For example:

if n1_weakref():
  print(f"name: {n1_weakref().name}")
else:
  print("n1 no longer exists")

The following figure shows a full example of creating a weak reference and accessing it in our running linked list example.

Demonstration of creating weak reference and using it

Output:

➜ uv run --python 3.13 --  weakref_cycles.py
n1 refcount: 2
n1 refcount: 2
n1’s name: n1
n1 is being deleted
n1 no longer exists
---------------------
No MyNode objects found

From the output we can confirm a few things:

Creating a weak reference does not increase the object’s reference count
A weak reference does not prevent the object from being deallocated if its reference count goes to 0 (in the example we deleted n1 and after that we were not able to access it using the weak reference.).

I leave the problem of fixing the cyclic reference that we created in test2() as an exercise for you.

Other Use Cases of Weak References

So far we’ve seen weak references as a tool for avoiding cycles, but their utility goes well beyond that. The weakref module also provides ready-made containers built on top of weak references. These containers, WeakValueDictionary and WeakSet, help you manage auxiliary data structures that should not extend the lifetimes of their contents. They solve practical problems such as caching, registries, and subscriber lists, where automatic cleanup is not just convenient but essential for avoiding leaks.

WeakValueDictionary

The weakref module provides WeakValueDictionary, which looks and behaves like a normal dictionary but with an important twist: its values are held only through weak references. If a value is no longer strongly referenced anywhere else, the dictionary entry disappears automatically.

This makes WeakValueDictionary a natural fit for caching and memoization. Imagine you compute expensive results or load large data structures and want to reuse them if they are still in memory. At the same time, you don’t want the cache itself to keep them alive forever. A WeakValueDictionary strikes that balance: it holds onto results only as long as the rest of the program does.

Another classic application is object interning or registries. For example, you may want to ensure there is only one canonical object representing a resource (like a symbol table entry, database connection, or parsed schema). By using a WeakValueDictionary, you avoid artificially extending the lifetimes of those resources.

Here’s a simple illustration:

import weakref

class Data:
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return f"Data({self.name})"

cache = weakref.WeakValueDictionary()
obj = Data("expensive_result")
cache["key"] = obj

print("Before deletion:", dict(cache))

# Drop the strong reference
obj = None

print("After deletion:", dict(cache))

Output:

Before deletion: {'key': Data(expensive_result)}
After deletion: {}

Notice how the cache entry vanishes automatically once the last strong reference goes away. There is no need for manual cleanup. Under the hood, this is implemented with weakref callbacks—the same mechanism we’ll see in the callback section.

WeakSet

Another container provided by the weakref module is WeakSet. This is similar to a regular set, except that it holds weak references to its elements. If an object is garbage collected, it will automatically vanish from the set.

One scenario where this is very handy is when you want to keep track of subscribers, observers, or listeners. These are objects that register interest in events produced by another object (often called the publisher). For instance:

GUI frameworks: widgets listen to events such as theme changes or window resizes.
Event buses: services subscribe to log events, metrics, or domain events.
Plugin systems: plugins register callbacks at load time to respond to hooks.
Background services: transient sessions (e.g., WebSocket connections) listen for updates from a long‑lived manager.

In all these cases, subscribers are often short‑lived, while the publisher lives much longer. Using a regular set to hold them risks memory leaks, because a strong reference in the set will keep the subscriber alive even when the rest of the program has forgotten it. With a WeakSet, the garbage collector automatically removes subscribers that are no longer strongly referenced anywhere else, so you don’t need explicit unsubscribe logic in every shutdown path.

Here’s a simple example:

import weakref

class Listener:
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return f"Listener({self.name})"

listeners = weakref.WeakSet()

l1 = Listener("A")
l2 = Listener("B")
listeners.add(l1)
listeners.add(l2)

print("Before deletion:", list(listeners))

# Remove one listener
l1 = None
import gc; gc.collect()

print("After deletion:", list(listeners))

Output:

Before deletion: [Listener(A), Listener(B)]
After deletion: [Listener(B)]

This pattern is often extended into a publisher–subscriber model:

class Publisher:
    def __init__(self):
        self._subs = weakref.WeakSet()
    def subscribe(self, sub):
        self._subs.add(sub)
    def notify(self, payload):
        for s in list(self._subs):
            s.handle(payload)

class Subscriber:
    def __init__(self, name):
        self.name = name
    def handle(self, payload):
        print(self.name, "got:", payload)

pub = Publisher()
sub = Subscriber("one")
pub.subscribe(sub)

pub.notify({"event": 1})  # delivered
sub = None                  # drop last strong ref
import gc; gc.collect()

pub.notify({"event": 2})  # nothing printed; WeakSet cleaned itself

Using WeakSet here avoids leaks and simplifies lifecycle management. A caveat is that only weak‑referenceable objects (i.e., user‑defined classes) can be added; built‑ins like int or tuple won’t work. If your class uses __slots__, include __weakref__ to allow weak references.

Callbacks on Weak References

Another useful feature of weakref.ref is the ability to attach a callback. A callback is a function that gets invoked automatically when the referent object is about to be finalized. This can be handy if you want to clean up auxiliary data structures or release resources when an object goes away.

import weakref

class Resource:
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return f"Resource({self.name})"

def on_finalize(wr):
    print("Resource has been garbage collected:", wr)

obj = Resource("temp")
wr = weakref.ref(obj, on_finalize)

print("Created weak reference:", wr)

# Drop strong reference
obj = None

# Force GC for demo purposes
import gc; gc.collect()

Output:

Created weak reference: 
Resource has been garbage collected:

Here, the on_finalize callback is called once the Resource instance is about to be collected. The weak reference itself becomes dead afterwards. This pattern is useful when you want to implement custom cleanup logic tied to an object’s lifecycle.

It’s also worth noting that containers like WeakValueDictionary and WeakSet use this same mechanism internally: they attach callbacks to their weak references so that entries are automatically removed when the referent objects are finalized.

Conclusion

Weak references are not a tool you’ll reach for every day, but when you need them they solve very real problems. At the lowest level, weakref.ref lets you point to an object without affecting its lifetime, and you can even attach a callback to run cleanup code at the moment it is collected. Building on that primitive, Python’s WeakValueDictionary and WeakSet give you higher level containers for caches, registries, and subscriber lists that automatically clean themselves up when their contents go away.

To summarize the differences:

A summary of the key APIs from the weakref module in Python and in which situation to use them

Together, these features make it possible to build memory‑friendly systems that avoid leaks, reduce bookkeeping, and respect the natural lifetimes of your objects. Understanding weak references and knowing when to apply them will help you write code that is both safer and more efficient.

Compiling Python to Run Anywhere

Abhinav Upadhyay — Tue, 23 Sep 2025 17:29:38 GMT

Foreword

A recurring theme of this newsletter is going under the hood: how interpreters, compilers, and runtimes actually work, and what performance trade‑offs they force on us. Python is a perfect case study: it’s beloved for its simplicity, but that same simplicity often means poor performance when the workloads get serious.

That’s why I’m really excited to share this guest post by , founder of Muna. In this piece, he looks at how Python could be pushed beyond its usual limits of speed and portability, laying out a compiler that turns ordinary code into fast, portable executables.

Instead of building another JIT or rewriting everything in C++, his approach generates optimized kernels while keeping the Python source unchanged. This ties directly to themes I’ve written about before, such as CPython internals, and performance engineering. All of those pieces showed why understanding systems at the lowest level matters. Yusuf’s work demonstrates the payoff of that mindset: the ability to design and build new systems on top of that knowledge.

Introduction

I first met Abhinav at the start of 2024, trying to learn more about how the Python interpreter worked under the hood. I had reached out to Abhinav with a singular goal in mind: to build something that could compile pristine Python code into cross-platform machine code.

This idea has been attempted in many forms before: runtimes (Jython, RustPython), DSLs (Numba, PyTorch), and even entirely new programming languages (Mojo). But for reasons we will explore later in this article, we needed something that could:

Compile Python entirely ahead-of-time, with no modifications.
Run without a Python interpreter, or anything other interpreter.
Run with minimal overhead compared to a raw C or C++ program.
Most importantly, run anywhere—server, desktop, mobile, and web.

In this article, I will walk through how this seemingly crazy idea came about, how we began building a solution, how AI happened to be the missing piece, and how we’ve grown to serve thousands of unique devices each month with these compiled Python functions.

Containers Are the Wrong Way to Distribute AI

I got my start in AI research around 2018, back when we called it “deep learning”. I had taken a year off from college and was coming off my first startup experience as co-founder of a venture-backed proptech startup that would later get acquired. One very interesting problem I had encountered in this journey was image editing for residential listings. Each month, a real estate photographer would outsource thousands of photos of homes to be hand-edited in Photoshop and Lightroom, before being posted on the regional MLS or on Zillow.

I teamed up with an old friend and we set out to build a fully automated image editor, using a new class of vision AI models called Generative Adversarial Networks (GANs). We would train our custom model architectures on our datasets, then test rigorously to ensure that the models worked correctly. But when it came time to get these AI models into the hands of our design partners, we simply got stuck. I spent the majority of my time trying to get our models into something we could distribute very easily. But after months of wrangling with Dockerfiles and third-party services, it became crystal clear to me: containers are the wrong unit of distribution for AI workloads.

To understand why, we need to look into the container. Containers are simply self-contained Linux filesystems with runtime isolation and resource management. So when deploying our AI model as a container, we would package up the inference code, the model weights, all the Python package dependencies, the Python interpreter itself, and other required software into what was effectively a snapshot of a full Linux operating system.

AI is better distributed in self-contained executables as opposed to containers.

But what if instead of making a self-contained operating system, we made a self-contained executable that ran only our AI model and nothing else? The benefits here would be significant: We could ship much smaller containers that started up much faster, because we wouldn’t have to include unnecessary Python packages, the Python interpreter itself, or any of the other unnecessary cruft that gets bundled into the container. But even more importantly, not only could we run these executables on our Linux servers—we could run them anywhere.

Arm64, Apple, and Unity: How It All Began

I started programming at the age of eleven, thanks to my dad who vehemently refused to buy me a PlayStation 2 out of fear that my grades would drop. Out of an extreme stubbornness, inherited from him and my mom, I had decided that if he was not going to buy me a game console, then I would simply build the games myself1. I was lucky enough to find a game engine that was intuitive, allowed developers to build once and deploy everywhere, and most importantly, was free to use: Unity Engine.

In late 2013 Apple debuted the iPhone 5S, its first device featuring the relatively new armv8-a instruction set architecture. Unlike prior devices, this was a 64-bit architecture running on ARM. With it, apps could address much more memory, and benefit from a myriad of performance gains. As such, Apple quickly mandated that all new apps be compiled for arm64.

Unity, with its massive developer ecosystem, was thrown into a tailspin. To understand why, we need some context on how Unity works: Because Unity is a game engine, objects within the game can be scripted to have custom behaviors. C# was Unity’s chosen scripting language for these behaviors. But C# does not compile to object code, so it needs a virtual machine to execute at runtime (sound familiar?). Unity used Mono for this purpose, but Mono did not support arm64.

Unity embarked on a journey to build what I still consider to be its greatest engineering feat: IL2CPP. As its name implies, the IL2CPP compiler would take in Common Intermediate Language bytecode (i.e. the intermediate representation generated by the C# compiler); then emit equivalent C++ source code. Once you had C++ source code, you could compile that code to run just about anywhere: from Nvidia GPUs and WebAssembly; to Apple Silicon and everything in-between.

How the IL2CPP compiler allows Unity run anywhere. Source: Unity.

We set out to build the exact same, for Python.

Sketching Out a Python Compiler

Sketching out a Python compiler.

At a high-level, the compiler would:

Ingest plain Python code, with no modifications.
Trace it to generate an intermediate representation (IR) graph.
Lower the IR to C++ source code.
Compile the C++ source code to run across different platforms and architectures.

Before jumping in, you might be wondering: why bother generating C++ first? Why not just go from IR to object code?

Going back to why we started on this journey, our main focus with Muna has been on compute-intensive applications, especially AI inference. If you’ve spent time in this space, you’re familiar with technologies like CUDA, MLX, TensorRT, and so on. But there are so many more frameworks, libraries, and even undocumented ISAs that applications can leverage to accelerate everything from matrix multiplication to computer vision.

We wanted to design a system that would allow us leverage as many ways to perform some computation as we might have available on given hardware. We’ll show you how we achieved this, and how this design gives us a novel, data-driven approach to performance optimization.

Building a Symbolic Tracer for Python

The first step in building our compiler is to build a symbolic tracer. The tracer’s job is to take in a Python function and emit an intermediate representation (IR) graph that fully captures control flow through the function.

Our very first prototypes were built upon the PyTorch FX symbolic tracer, introduced in PyTorch 2.0. Their symbolic tracer was built off PEP 523, a feature in CPython that allowed developers in C to override how bytecode frames are evaluated by the interpreter. I won’t go into too much detail here, as it is a marvel of engineering in its own right, but in summary PEP 523 enabled the PyTorch team to register a hook that could record every single function call as it was being evaluated by the interpreter:

Confessions of a Code Addict

The Design & Implementation of the CPython Virtual Machine

For every bytecode compiled language, the most interesting part of its implementation is its virtual machine (also referred to as the bytecode interpreter) where the bytecode execution takes place. Because this is such a crucial part of the language machinery, its implementation has to be highly performant. Even if you are not a compiler engineer, learning about such internal implementation can give you new performance tricks and insights that you may be able to use in other places of your job. And, if you are a compiler engineer then you should always look around how other languages are implemented to pickup implementation details that you may not be aware of…

2 years ago · 46 likes · Abhinav Upadhyay

Unfortunately, TorchFX had two significant drawbacks that required us to build a custom tracer. The first is that once you hook into the CPython interpreter to record your PyTorch function, you have to actually run said function. For PyTorch, this was not an issue because you could invoke your function with so-called “fake tensors” that had the right data types, shapes, and devices, but allocated no memory. Furthermore, this way of running a function in order to trace it would be perfectly inline with how their legacy serialization APIs worked (torch.jit and torch.onnx).

Since we needed the ability to compile arbitrary Python functions, of which only a tiny subset (or none) could be PyTorch, we would need a similar mechanism for having developers provide us with their inputs to use for tracing. But unlike PyTorch, we could not create a fake image, or fake string, or fake whatever. To us, this became a dead end.

The second challenge was that even when we created fake data as inputs to the TorchFX tracer, we realized that it could only record PyTorch operations. We would have to heavily modify and extend the tracer to support tracing through arbitrary functions across hundreds and thousands of Python libraries. As such, we settled on building a tracer that would instead capture a Python function by parsing its abstract syntax tree (AST). Take an example function:

Simple function that computes the area of a shape.

Our tracer would first extract an AST like so:

Visualized AST of the `compute_area` function above.

It would then step through, resolve all function calls (i.e. figure out what source library each function call belongs to), then emit a proprietary IR format. Currently, our symbolic tracer supports static analysis (via AST parsing); partial evaluation of the original Python code; live value introspection (using a sandbox), and much more. But somehow, it’s the least interesting part of our compiler pipeline.

Lowering to C++ via Type Propagation

This is where things get really interesting. Python is a dynamic language, so variables can be of any type, and those types can change easily:

Example demonstrating Python’s dynamic nature.

C++ on the other hand, is a strongly-typed language, where variables have distinct, immutable types that must be known when the variable is declared. While bridging both of these languages might seem like an intractable problem, there’s actually a key insight we can take advantage of:

When we invoke a Python function with some given inputs, we can uniquely determine the types of all intermediate variables within that function2:

We can track the types of every variable when a Python function is invoked.

If we know that the inputs x and y are float instances, then we know that the resulting type of their multiplication (i.e., tmp_1) is uniquely determined by whatever the operator.mul function returns. But how do we define operator.mul and get its return type? That’s where C++3 comes in.

Example C++ implementation of Python’s multiplication operator.

From the above, we now know that tmp_1 must be a float. We can repeat this process for the addition call (tmp_1 + z) to get the final result.

At this point, it’s worth taking a moment to reflect on what we have created thus far:

We can take a Python function and generate an intermediate representation (IR) that fully captures what it does.
We can then use parameter type information; and a C++ implementation of a Python operator (e.g. operator.mul); to fully determine the type of the first intermediate variable in our Python function.
We can repeat (2) for all subsequent intermediate variables in our Python function, until we have propagated types throughout the entire function.

Seeding the Type Propagation Process

One point worth expanding upon is how we get the initial parameter type information to kickstart the type propagation process. In the example above, how do we know that each of x, y, and z are float instances?

After prototyping with a few different approaches, we settled on PEP 484 which added support for type annotations in the Python language. Python itself completely ignores these type annotations, as they are not used at runtime4. And while they solved the problem of seeding type propagation, they came with two major drawbacks: first, they conflict with our most important design goal, because using them requires developers to modify their Python code a little5:

Adding type annotations to our Python function.

The code doesn’t look too different, and some argue that using type annotations makes for writing better Python code (we mandate them at Muna). The second problem was that in order to design a simple and modular interface for consuming the compiled functions, we would have to constrain the number of distinct input types that could be used by developers6. Ultimately, we decided this was a reasonable compromise with nice ergonomics.

Building a Library of C++ Operators

At this point, you might have realized a glaring issue in our design: we need to write C++ implementations for potentially tens or hundreds of thousands of Python functions across different libraries. Thankfully, this is a lot less complicated than you might think. Consider the function below:

We only need C++ to cover functions whose definitions are not available.

When compiling cosecant, we see that there are function calls to sin and reciprocal. Our compiler first checks if it can trace through each function call. In the case of sin, we don’t have a function definition for it (only an import), so we cannot trace through it7. This forms a leaf node that we must implement manually in C++. We can trace through the call to reciprocal, so we do and get an IR graph for it. This can then be lowered and used at other call sites.

The key insight above is that most Python functions our compiler will encounter are composed of a smaller set of elementary functions. What accounts for the large variety of code in the wild is not the unique number of elementary functions that make them up; rather, it’s the different arrangements of these elementary functions.

Still, you could argue that there are potentially thousands of these elementary functions across different libraries that we would have to cover, and you would be 100% correct. Thankfully, we now have an amazing tool that makes this an easy problem to solve: AI-powered code generation.

Today’s LLMs are capable of writing verifiably-correct, high-performance code across a wide variety of programming languages. As such, we’ve been building infrastructure to constrain the code they generate, test the code to ensure correctness, and handle ancillary logic for things like dependency management and conditional compilation. So far, we have used AI to generate implementations of hundreds of Python functions across popular libraries like Numpy, OpenCV, and PyTorch.

Performance Optimization via Exhaustive Search

The final topic worth discussing is performance optimization. Most popular approaches here involve rolling out hand-written code (e.g. Assembly or PTX); using heterogenous accelerators (e.g. GPU, NPU); doing heuristic-based algorithm selection at runtime (e.g. convolution algo search in ArmCL and cuDNN); or some combination thereof.

From our past experience building extremely low-latency computer vision pipelines for embedded systems, we have learned a very bitter lesson8: effective performance optimization is always empirical. The latency of a given operation on some given hardware depends on so many factors that the only way to know for sure it to simply test every single approach you have. The only reason why engineering teams don’t do this is because it is impractical: you would have to rewrite your code tens or hundreds of times then test each variant…but wait!

Earlier, we went over how we propagated types through a Python function with the help of a C++ operator. What I didn’t mention was that we don’t just use one C++ operator; we use as many as we can write (*ahem* generate). So instead of this:

We don’t just generate one C++ program from a Python function.

What really happens is this:

We generate as many C++ programs as we can from a single Python function.

Each path from start to result is a unique program, guaranteed to be correct with respect to the original Python function. But each C++ operator (colored rectangles) could be powered by different algorithms, libraries, and even hardware accelerators. Let’s walk through a concrete example:

A simple Python function to resize an image.

The function above resizes an input image to 64x64 with bilinear resampling using the torchvision library. When compiling this function for Apple Silicon (macOS, iOS, or visionOS), we have a range of approaches and libraries to choose from, including:

Each implementation uses a different approach to resizing the image.

The above is just a small selection, as the possibilities for implementing a bilinear resize operation on Apple Silicon are numerous (e.g. using the GPU, Neural Engine). The key here is that we can generate as many of these as possible (thanks to LLM-powered codegen), then emit compiled programs that use each one—with absolutely no limits. So in the example above, the user’s Python function would be emitted as four unique programs for Apple Silicon alone. In our real-world testing, we have seen a single Python function be emitted as almost 200 unique programs across 9 compile targets.

From here, we can easily test each compiled function to discover which one runs the fastest on given hardware. We gather fine-grained telemetry data, containing latency information for each operation, and use this data to build statistical models to predict which variant runs the fastest. There are two significant benefits in this design:

We can optimize code purely empirically. We don’t make any assumptions about which code might perform best; and we don’t need a separate performance tuning step after generating code. We simply ship out every compiled binary we have, gather telemetry data, and use this to discover which one is the fastest.
We benefit from network effects. Because the C++ operators are shared among thousands of compiled functions; and because we ship these compiled functions to hundreds of thousands of unique devices across all of our users; we have tons of data that we can use to optimize every piece of code we generate.

For our users, this will feel like their compiled Python functions running faster over time, entirely on autopilot.

Designing a User Interface for the Compiler

Now, we have to wrap up everything we’ve covered above into a user interface. Our most important guiding principle was to design something with near-zero cognitive load. Specifically, we didn’t want developers to have to learn anything new to use the compiler. We decided to go with PEP 318, decorators:

Developers simply have to @compile their Python function.

Developers could simply decorate their Python function with @compile to specify the compilation entrypoint. Then, they would compile the function9 and all its dependencies using the CLI:

We fell in love with the decorator paradigm from seeing how developers strongly preferred expressing complex infrastructure as code10. Furthermore, it was a familiar form factor within the Python ecosystem, evidenced by its use within Numba and PyTorch. With the decorator, our CLI could find the compilation entrypoint function, and use that as a springboard to crawl through all other dependency code (both first-party as provided by the developer, and third-party packages installed via pip or uv).

The @compile decorator would also serve as the primary customization point for developers compiling their function. Beyond the required tag (which uniquely identifies the function on our platform) and description, developers could provide a sandbox description to recreate their local development environment (e.g. installing Python packages, uploading files); along with metadata to assist the compiler during codegen (e.g. running PyTorch AI inference with ONNXRuntime, TensorRT, CoreML, IREE, QNN, and more).

Once compiled, anyone can run the compiled function anywhere11:

Invoking the compiled function with the Muna command line interface.

Closing Thoughts

In all candor, we still have a high level of disbelief that any of this actually works. That said, the compiler has a lot of standard Python features that are partial or missing: exceptions, lambda expressions, recursive functions; and classes. The through-line connecting these missing features is our type propagation system. While type propagation works for simple functions with unitary parameter and return types, it requires additional consideration for composite types (e.g. unions) and higher-order types (e.g. classes, lambda expressions).

The other significant item we are still figuring out is the debugging experience. The good news for us is that we guarantee that developers’ Python code will work as expected once compiled, absolving them of any responsibility to debug the code at runtime. This is similar to how developers who use Docker or other containerization technologies simply expect everything to work—almost nobody debugs their Docker image layers. The bad news is that because we enable developers run their Python code anywhere, we have to figure out how to write extremely safe code; and how to gather fine-grained, symbolicated trace data when some function raises an exception. This is further complicated by the fact that because we have to deliver the smallest and fastest compiled binaries possible, we compile generated code with full optimizations, inevitably stripping out valuable debug information.

It has not all been difficult though, especially because the evolving C++ standard has been a major boon for us. Muna would not exist without C++20, because our code generation relies extensively on std::span, concepts, and most importantly, coroutines. And we’re dying for broad C++23 support, because we use std::generator to support streaming, to support float16_t and bfloat16_t, and to support Python exceptions.

On a final note, if you are currently deploying embedding models or object detection models in your organization, or if you find any of this work interesting, we would love to chat with you. We’d love for more developers to use the compiler on problems and programs we haven’t yet run ourselves; and we love to meet developers who enjoy the mundane, low-level worlds of Python, C++, and everything in-between.

Come Chat with Us

Rumble Racing and Sly Cooper: Not the most well-known titles on the PS2, but games that carry incredibly amounts of sentimental value from my upbringing.

One way to think about bridging Python and C++ during lowering is in how implicit template instantiation works in C++. The Python function defines a template function; and the input types are used to instantiate a concrete function therefrom.

Technically, our compiler doesn’t just compile to C++. We use C++ as the primary language for code generation, but we also emit Objective-C and Rust in some cases. Furthermore, we are actively exploring emitting Mojo.

The main exception here comprises of data validation libraries like Pydantic which use type hints to build schemas for validating and serializing data.

Only the compiler entrypoint function (i.e. the function which is decorated with @compile) requires type annotations. All other functions can be duck typed as normal.

Only the compiler entrypoint function (i.e. the function which is decorated with @compile) is subject to this constraint. All other functions can accept arbitrary input types, and return arbitrary output types.

Our @compile decorator supports providing a list of trace_modules which opt entire modules into tracing. Functions that are not provided as part of a developer’s original Python code must explicitly be opted into tracing.

The Bitter Lesson by Rich Sutton.

By default, we currently compile for Android, iOS, Linux, macOS, WebAssembly, and Windows.

We took inspiration from projects like Pulumi and Modal.

We provide client libraries for Python, JavaScript (browser and Node.js), Swift (iOS), Kotlin (Android), and Unity Engine. And our React Native client is coming soon.

What Makes System Calls Expensive

Abhinav Upadhyay — Tue, 16 Sep 2025 18:03:29 GMT

Cover: A Flamegraph highlighting performance overhead due to system calls

System calls are how user programs talk to the operating system. They include opening files, reading the current time, creating processes, and more. They’re unavoidable, but they’re also not cheap.

If you’ve ever looked at a flame graph, you’ll notice system calls often show up as hot spots. Engineers spend a lot of effort cutting them down, and whole features such as io_uring for batching I/O or eBPF for running code inside the kernel exist just to reduce how often programs have to cross into kernel mode.

Why are they so costly? The obvious part is the small bit of kernel code that runs for each call. The bigger cost comes from what happens around it: every transition into the kernel makes the CPU drop its optimizations, flush pipelines, and reset predictor state, then rebuild them again on return. This disruption is what makes system calls much more expensive than they appear in the source code.

In this article, we’ll look at what really happens when you make a system call on Linux x86-64. We’ll follow the kernel entry and exit path, analyse the direct overheads, and then dig into the indirect microarchitectural side-effects that explain why minimizing system calls is such an important optimization.

CodeRabbit: Free AI Code Reviews in CLI (Sponsored)

CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code

As developers increasingly turn to CLI coding agents like Claude Code for rapid development, a critical gap emerges: who reviews the AI-generated code? CodeRabbit CLI fills this void by delivering senior-level code reviews directly in your terminal, creating a seamless workflow where code generation flows directly into automated validation. Review uncommitted changes, catch AI hallucinations, and get one-click fixes - all without leaving your command line. It's the quality gate that makes autonomous coding truly possible, ensuring every line of AI-generated code meets production standards before it ships.

Get Started Today

Background on System Calls

Let’s start with a quick overview of system calls. These are routines inside the kernel that provide specific services to user space. They live in the kernel because they need privileged access to registers, instructions, or hardware devices. For example, reading a file from disk requires talking to the disk controller, and creating a new process requires allocating hardware resources. Both are privileged operations, which is why they are system calls.

Calling a system call requires a special mechanism to switch execution from user space to kernel space. On x86-64 this is done using the syscall instruction, where you place the syscall number in rax and the arguments in registers (rdi, rsi, rdx, r10, r9, r8), then invoke syscall:

# set args for calling read syscall
movq $1, %rax
movq $1, %rdi
movq $buf, %rsi
movq $size, %rdx
syscall # we enter the kernel here
movq %rax, %rbx

On encountering this instruction, the processor switches to kernel mode and jumps to the registered syscall entry path. The kernel completes the context switch (switching the page tables and stack) and then jumps to the specific syscall implementation.

When the syscall finishes, it places the return value in rax and returns. Returning requires another privilege mode switch, reversing everything done on entry: restoring the user page table, stack, and registers.

The following diagram illustrates the sequence of steps required to execute a system call (read in this case).

Flow of a read system call: user space sets up arguments and invokes syscall, control transfers to the kernel entry handler, the kernel executes the system call (keys_read), and then returns control back to user space.

In the figure:

User space code sets up arguments for the read system call.
It invokes the system call using the syscall instruction.
The instruction switches to kernel mode and enters the syscall entry handler, where the kernel switches to its own page table and stack.
The kernel then jumps to the implementation of the read system call.
After returning, the kernel restores the user space page table and stack, then control resumes at the next user instruction.

Now that we have this high-level overview, let’s look inside the Linux kernel’s syscall handler to understand each step in more detail.

Inside the Linux Syscall Handler

When a system call is invoked, the CPU jumps into the kernel’s designated system call handler. The following diagram shows the Linux kernel code for this handler for the x86-64 architecture from the file entry_64.S. In the diagram, you can see the set of steps the kernel needs to perform before it can actually execute the system call. Let’s briefly discuss each of these.

Actual x86-64 syscall entry code from Linux kernel (entry_64.S), annotated to show the steps the kernel performs before invoking the system call.

Swapping the GS Register

GS is a segment register in the x86 architecture. In user space it is primarily used for thread-local storage (TLS). In kernel space it is used for implementing per-cpu variables, such as a pointer to the currently executing task. So, the first thing that the kernel does is restore the kernel mode value of the GS register.

Switching to Kernel Page Table and Kernel Stack

The Linux kernel has its own page table with mappings for kernel memory pages. To be able to access its memory it must restore this page table. It does this by calling the SWITCH_TO_KERNEL_CR3 macro.

On x86, the CR3 control register is designated to store the address of the root of the page table. This is why the macro for switching page tables is called SWITCH_TO_KERNEL_CR3.

Separately, the kernel has its own fixed-size stack for executing kernel-side code. At this point the rsp register still points to the user space stack, so the kernel saves it in a scratch space and then restores its own stack pointer from a per-cpu variable.

When returning from the system call, the kernel restores the user page table and stack by reversing these operations. This code is not shown in the diagram but happens right after the “call do_syscall_64” step.

Saving User Space Registers

At this time, the CPU registers still contain the values they had while executing user space code. They will be overwritten when the kernel code executes, to avoid that from happening, the kernel saves the values on the kernel stack. After that it sanitizes those registers for security. All of this can be seen in boxes 3 and 4 in the diagram.

Mitigations Against Speculative Execution Attacks

The next three steps in the code are:

Enabling IBRS (indirect branch restricted speculation)
Untraining the return stack buffer
Clearing the branch history buffer

These are there to mitigate against speculative execution attacks, such as spectre (v1 and v2), and retbleed. Speculative execution is an optimization in modern processors where they predict the outcome of branches in the code and speculatively execute instructions at the predicted path. When done accurately, this significantly improves the performance of the code.

However, vulnerabilities have been found where a malicious user program may train the branch predictor in ways that cause the CPU to speculatively execute along attacker‑chosen paths inside the kernel. While these speculative paths do not change the logical flow of kernel execution, they can leak information through microarchitectural side‑channels such as the cache.

These mitigations prevent user‑controlled branch predictor state from influencing speculative execution in the kernel. But, these also come at a great performance cost. We will revisit these in detail later, when discussing the impact of system calls on branch prediction.

Executing the System Call and Returning Back to User Space

After all of this setup, the kernel finally calls the function do_syscall_64. This is where the actual system call gets invoked. We will not look inside of this function because our focus is on performance impact rather than a walkthrough of kernel code.

Once the system call is done, the do_syscall_64 function returns. The kernel then restores the user space state, including registers, page table, and stack, and returns control back to user space. The following diagram shows the code after the do_syscall_64 call to highlight this part.

Actual x86-64 syscall exit path code from Linux kernel (entry_64.S), showing how the kernel restores user registers, page tables, and state before returning control to user space.

Now that we have seen all the code the kernel executes to enter and exit a system call, we are ready to discuss the overheads introduced. There are two categories:

Direct overhead from the code executed on entry and return.
Indirect overhead from microarchitectural side-effects (e.g. clearing the branch history buffer and return stack buffer).

The major focus of this article is on discussing the indirect overhead induced due to system calls. But before we go any further, let’s do a quick benchmark to measure the impact of the direct overheads.

Direct Overhead of System Calls

Direct overhead is largely fixed across all system calls, since each system call must perform the same entry and exit steps. We can do a rough measurement of this overhead with a simple benchmark by comparing the number of cycles taken to execute the clock_gettime system call in the kernel versus executing it in the user space.

The clock_gettime system call reads a system clock, such as the realtime clock (seconds since the Unix epoch) or the monotonic clock (seconds since kernel boot). It is very frequently used in software. For example, Java’s System.currentTimeMillis() and Python’s time.time() and time.perf_counter() use it under the hood.

Because system calls are expensive, Linux provides an optimization called vDSO (virtual dynamic shared object). This is a user-space shortcut for selected system calls where the kernel maps the system call's code into each process’s address space so that they can be executed like a normal function call, avoiding kernel entry.

So, we can create a benchmark that measures the time taken to execute clock_gettime in the user space using vDSO and compare it against the time taken inside the kernel using the syscall interface. The following code shows the benchmarking program.

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
  const int ITERS = 100000;
  uint32_t cpuid;
  struct timespec ts;
  
  // Warm up both syscall and libc versions
  for (int i = 0; i < 10000; i++) {
    syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &ts);
    clock_gettime(CLOCK_MONOTONIC, &ts);
  }

  // Test 1: Direct syscall interface
  _mm_lfence();
  uint64_t start1 = __rdtsc();
  long sink1 = 0;
  for (int i = 0; i < ITERS; i++) {
    long ret = syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &ts);
    sink1 += ret + ts.tv_sec + ts.tv_nsec; // use the results to prevent optimization
  }
  uint64_t end1 = __rdtscp(&cpuid);
  _mm_lfence();

  // Test 2: libc clock_gettime
  _mm_lfence();
  uint64_t start2 = __rdtsc();
  long sink2 = 0;
  for (int i = 0; i < ITERS; i++) {
    int ret = clock_gettime(CLOCK_MONOTONIC, &ts);
    sink2 += ret + ts.tv_sec + ts.tv_nsec; // use the results to prevent optimization
  }
  uint64_t end2 = __rdtscp(&cpuid);
  _mm_lfence();

  // Prevent dead-code removal
  if (sink1 == 42 || sink2 == 42) fprintf(stderr, "x\n");

  double cycles_per_syscall = (double)(end1 - start1) / ITERS;
  double cycles_per_libc = (double)(end2 - start2) / ITERS;
  
  printf("Direct syscall cycles per call ~ %.1f\n", cycles_per_syscall);
  printf("Libc wrapper cycles per call ~ %.1f\n", cycles_per_libc);
  printf("Difference ~ %.1f cycles (%.1f%% %s)\n", 
         cycles_per_libc - cycles_per_syscall,
         100.0 * (cycles_per_libc - cycles_per_syscall) / cycles_per_syscall,
         cycles_per_libc > cycles_per_syscall ? "slower" : "faster");
  
  return 0;
}

A note on rdtsc: Normally, you would use clock_gettime() to measure timings. But here we are benchmarking clock_gettime() itself, so we need something more precise. rdtsc is an x86 instruction that reads the value of a 64‑bit timestamp counter (TSC) in the CPU. This counter ticks at a fixed frequency (e.g. 2.3 GHz on my machine). By measuring its value before and after, we can know how many cycles an operation took.

The program produces the following output on my laptop:

➜ ./clock_gettime_comparison 
Direct syscall cycles per call ~ 1428.8
Libc wrapper cycles per call ~ 157.0
Difference ~ -1271.9 cycles (-89.0% faster)

The vDSO version is an order of magnitude faster, showing how costly the syscall entry/exit path is compared to a plain function call.

We should take this estimate with a grain of salt because in the benchmark we are measuring inside a loop, and the performance of the loop itself can suffer from the indirect side‑effects of entering and exiting the kernel, which is our next topic.

While this benchmark isolates direct overhead, real‑world performance also suffers from indirect costs due to CPU microarchitectural effects. Let’s explore those next.

Indirect Overhead of System Calls

System calls also incur indirect costs, because the kernel’s entry path disturbs the CPU’s microarchitectural state. These side-effects impact the microarchitectural state of the process in the CPU and the loss of this state can introduce transient degradation in the performance of the user space code.

At the microarchitecture level, the CPU implements several optimizations such as instruction pipelining, superscalar execution and branch prediction. These are designed to improve the instruction throughput of the program, i.e., how many instructions the CPU can execute each cycle. A higher throughput means faster program execution.

It can take a few cycles for the CPU to get to a steady state where these optimizations start to pay off, but making system calls can lead to the loss of this state and a drop in the performance of the program.

We will cover the indirect costs of system calls by discussing the different components of the microarchitecture that are impacted, starting from the instruction pipeline, followed by the branch predictor buffers.

Effect on the Instruction Pipeline

We didn’t see any code in the Linux kernel which touches the instruction pipeline, rather this is done by the CPU itself. Before switching to kernel mode, the CPU drains the instruction pipeline to ensure that the user space code does not interfere when the kernel code executes. This impacts the performance of the user space code when the system call returns. To understand how, we need to revisit the basics of instruction pipelining.

CPUs have multiple execution resources, such as registers, execution units, load and store buffers etc. To use all of these effectively, it is necessary that they executes multiple program instructions in parallel, this is made possible through instruction pipelining and superscalar architecture.

Instruction pipelining breaks down the execution of an instruction into several stages, like the assembly pipeline in a factory. An instruction moves from one stage to the next in each CPU cycle, enabling the CPU to start executing one new instruction each cycle.

For example, the following diagram shows a 5-stage pipeline. You can see that it takes five instructions for the pipeline to fill completely, and for the first instruction to retire. After this stage, the pipeline is in a steady state, and it can provide a throughput of one instruction per cycle. This is a very simplistic example, modern x86 processors have much deeper pipelines, e.g. 20-30 cycles.

Example of a simple 5-stage instruction pipeline (Fetch, Decode, Memory Read, ALU, Memory Write), showing how multiple instructions overlap in execution across cycles.

Modern processors are also superscalar. They have multiple such pipelines to issue and execute multiple new instructions each cycle. For example, a 4-wide processor can start executing up to 4 new instructions each cycle and it can retire up to 4 new instructions each cycle. If such a CPU has a pipeline depth of 20, then it can have up to 80 instructions in flight in a steady state.

This means that the processor is normally busy executing dozens of user-space instructions in parallel. But when a system call occurs, the CPU must first ensure all pending user instructions finish before it can jump into the kernel.

So, when the system call returns back to the user space, you can imagine that the instruction pipeline is almost empty because the CPU did not allow the instructions following syscall to enter the pipeline. At this point the pipeline has to start almost from scratch, and it can again take a while until the pipeline reaches a steady throughput again.

Contrast this with the scenario where no system call occurs: the CPU remains in its steady state, pipelines stay full, and instruction throughput stays high. In other words, a single system call can derail the momentum of dozens of in‑flight instructions.

On x86-64, the syscall instruction is used to execute a system call. The Intel manual has this note about it:
“Instruction ordering: Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible).”
This confirms that the CPU drains the pipeline before transferring control to the kernel.

Effect on Branch Prediction

The next major indirect impact system calls have on user space performance is through the clearing of the branch predictor buffers. These can be grouped as three mitigations the kernel applies that we saw in the kernel code above.

Clearing the branch history buffer
Untraining the return stack buffer
Enabling/disabling the IBRS

The first two of these have a profound indirect impact on user code performance. The enabling/disabling of IBRS does not impact user space performance, rather only adds a direct overhead to syscall execution. However, I will discuss this here because logically it goes with the topic of branch prediction. In this section, we will first review branch prediction and then talk about each of these.

Understanding Branch Prediction

Instruction pipelining and superscalar execution enables CPUs to execute multiple instructions in parallel, and they execute these instructions out-of-order.

When the CPU comes across a branching instruction, such as an if condition, it may not know the result of the condition because those set of instructions may still be executing. If the CPU waits for those instructions to finish to know the branch outcome, the pipeline can be stalled for a long time, which means poor performance.

To optimize this, the CPUs come with a feature called the branch predictor that can predict the target address of these branches based on past branching patterns. This enables the CPU to speculatively execute the instruction from the predicted address and stay busy. If the prediction turns out to be correct, then the CPU saves a lot of cycles and instruction throughput remains high.

However, when the prediction is wrong, the CPU has to discard the results of these speculatively executed instructions, flush the instruction pipeline, and fetch the instructions from the right address. This can cost 20-30 cycles on modern CPUs (depending on the depth of the pipeline).

Clearing the Branch History Buffer

We saw in the kernel code that it invokes the macro CLEAR_BRANCH_HISTORY which clears the branch history buffer (BHB).

The BHB is a buffer in the branch predictor that learns the branching history patterns at a global level. This helps the branch predictor predict the outcomes of deeply nested and complex branching patterns more accurately. You can think of it as remembering the last few intersections you passed to better predict where you’ll turn next.

But it can take a while for the BHB to collect enough history for the branch predictor to generate accurate predictions. So, whenever you execute a system call in your code, if the kernel clears the BHB, you lose all that state. As a result, your user space code may experience an increased rate of branch mispredictions after returning from the system call. This can significantly degrade the performance of user space applications.

Note on recent CPUs: This clearance of BHB was added to the kernel as a mitigation against speculative execution attacks, such as Spectre V2. In recent years, CPU vendors have introduced hardware mitigations which obviate the need for the kernel to clear the BHB. For example, the Intel advisory says that if your CPU comes with the "enhanced IBRS" (we discuss IBRS below) feature, then there is no need to clear the BHB. So, not all CPUs suffer degraded performance due to this.
If you want to check whether your kernel clears the BHB, you can check the lscpu output. If you see “BHI SW loop” in the vulnerability section, it means that the kernel clears the BHB during system calls.
Also, if you believe that you will never execute untrusted code, you can manually disable the mitigation through a boot time flag.

Untraining the Return Stack Buffer

Next in the line is untraining of the return stack buffer (RSB). The RSB is another buffer in the branch predictor that is used to predict the return address of function calls.

But why does it need to predict the return address? It again comes down to out-of-order execution. The CPU may want to execute the return instruction even though other instructions of the function may still be executing. At this point, the CPU does not know the return address. The return address is stored on the process’s stack memory, but accessing memory is slow. So, the CPU uses the RSB to predict the return address.

On every function call, the CPU pushes the return address into the RSB. While executing the return instruction, the CPU pops this buffer and jumps to that address. Because this buffer right in the CPU, it is very fast to access.

However, this also led to vulnerabilities such as Retbleed. In this attack, carefully chosen user‑space code could influence how the CPU predicted kernel return addresses, so that the CPU speculatively executed instructions at the wrong place inside the kernel. While this speculative execution did not change the actual kernel logic, it could leak information through side‑channels. To prevent this, the kernel untrains the RSB on entering the kernel.

Untraining the RSB impacts the performance of the user space code when the system call returns because now the RSB does not have the state. Without a trained RSB, the CPU falls back to a slower indirect branch predictor which may have higher chances of making a mistake.

Note on CPUs Impacted: The kernel does not clear the RSB for all the CPU models. The vulnerabilities that require clearing the RSB (retbleed and SRSO) have only been known to impact AMD CPUs. Also, if your CPU has hardware mitigations, such as enhanced IBRS, then the kernel does not perform this (the UNTRAIN_RET macro becomes a noop on such devices).
Again, the kernel allows you to disable the mitigation but do this only when you are sure that you will never run untrusted code.

IBRS Entry and Exit

Finally, let’s talk about indirect branch restricted speculation (IBRS). We saw that the kernel executes IBRS_ENTER on entering the syscall and IBRS_EXIT while returning back. So, what is IBRS and what is its impact on performance?

IBRS is a hardware feature which restricts the indirect branch predictor when executing in kernel mode. Effectively, it prevents the user space training of the indirect branch predictor from having any effect on indirect branch prediction inside the kernel.

Indirect branches are those branches in code where the target address is not part of the instruction but is known only at runtime. A common example is calling through a function pointer in C (e.g., (*fp)()), where the actual target depends on which function the pointer holds at that moment. Another example is a virtual function call in C++ or a jump table generated for a large switch statement. In all these cases, the CPU can use the indirect branch predictor to guess the likely target address based on past branching history.

When the Spectre and related vulnerabilities were found, one of the attack vectors involved tricking the CPU into mispredicting indirect branch targets inside the kernel. By influencing the branch predictor state from user space, attackers could cause the CPU to speculatively execute instructions at unintended locations in the kernel. It could lead to leak of sensitive kernel data through side-channels such as the cache.

The mitigation for this attack is to restrict the indirect branch predictor when executing in kernel mode via the IBRS mechanism. Enabling and disabling IBRS itself doesn’t have any impact on the performance of the user space code, but the act of executing extra instructions to do this during each system call adds overhead.

However, recent CPUs have a feature called enhanced IBRS which automatically enables IBRS when switching to kernel mode. On such devices, the IBRS_ENTER and IBRS_EXIT macros in the kernel become a noop.

Together, these mitigations explain why the indirect cost of system calls can vary significantly across CPU generations and configurations. In practice, this means a single system call can not only drain the pipeline but also leave the branch predictor partially blind, forcing the CPU to relearn patterns and slowing down your code until it recovers. The important point is that the true cost of a system call is not just the handful of instructions executed in the kernel, but also the disruption it causes to the CPU’s optimizations. This makes system calls far more expensive than they look on the surface, and why minimizing them can be such a powerful optimization strategy. However, slowly CPU vendors are adding hardware mitigations which is making these software-based mitigations obsolete and reducing the performance overheads.

Practical Ways to Reduce System Calls

So what can you do as a developer? A few practical ideas:

Use vDSO: For calls like clock_gettime, prefer the vDSO path to avoid kernel entry.
Cache cheap values: Some values obtained through system calls rarely change during a program’s lifetime. If you can safely cache them once and reuse, you can avoid repeated system calls.
Optimize I/O System Calls: There are various strategies and patterns that you can use to optimize I/O related system calls. For example:
- Prefer buffered I/O instead of raw read/write system calls
- Use scatter/gather operations like readv/writev to batch multiple buffers
- If your system allows, use mmap instead of repeated read/write calls.
Batch operations: Interfaces like io_uring let you submit many I/O requests to a shared queue in user space, which the kernel can then process in batches. This reduces the number of times your program needs to cross into the kernel.
Push work into the kernel: With eBPF it is increasingly possible to move parts of application logic into the kernel itself. Beyond traditional use cases like packet filtering, newer frameworks let you offload tasks such as policy enforcement, monitoring, and even parts of data processing. In these cases, instead of making repeated system calls, the user program loads small programs into the kernel that run directly when events occur, avoiding crossings altogether.

None of these tricks are magic, but they all follow the same principle: fewer crossings means less disruption. Every time you avoid a system call, you’re saving not just a function call into the kernel, but also the hidden costs of the CPU recovering its state.

Wrapping Up

We’ve gone through a lot of detail for what looks like just a small stretch of kernel code. The point is simple: the cost of a system call goes beyond the small number of instructions that execute in the kernel. It disrupts the CPU’s rhythm by draining pipelines, resetting predictors, and forcing everything to start fresh. That’s why they show up as hot spots in profiles and why people try so hard to avoid them.

The strategies we looked at earlier (vDSO, caching, optimizing I/O, batching with io_uring, and pushing work into the kernel) are all ways to cut down on this disruption. They won’t remove the cost of system calls entirely, but they can make the difference between code that spends most of its time waiting on the kernel and code that keeps the CPU running at full speed.

System calls are the interface to the kernel and the hardware. They are necessary, but they come at a cost. Understanding and managing that cost is a key part of writing faster software.

Leveraging CPU’s Micro-Op Cache for Faster Loops

Abhinav Upadhyay — Fri, 15 Aug 2025 05:37:44 GMT

Performance engineering can be deeply mysterious. Sometimes adding a line of code can make your program execute 2× faster. These behaviors are impossible to explain unless you understand the processor microarchitecture and compiler optimization tricks.

In this video, I show how adding a single line of code to a slow-running program makes it run 2× faster. You’ll see how this one change helped the compiler arrange instructions in memory so the CPU could fetch them from its micro-op cache instead of decoding them every time, a huge win for hot loops.

On Intel processors, this micro-op cache is known as the Decoded Stream Buffer (DSB). It’s designed specifically to accelerate hot paths in your code by caching pre-decoded instructions, so the CPU can skip the expensive fetch/decode stages entirely. Understanding when and how the DSB kicks in is key to unlocking this kind of speedup.

If you’re curious about controlling the hardware and squeezing out every last ounce of performance, you should watch the video.

Along the way, we’ll cover:

Measuring performance with Linux perf
Using Top-Down Microarchitectural Analysis (TMA) to pinpoint hardware bottlenecks
Understanding what the DSB is and when it’s used
Forcing the compiler to take advantage of it with code alignment and profile-guided optimization

The result is 2x faster loop and a set of techniques that you can use for debugging and optimizing your own loops.

What’s Next

In this video, I showed how one condition affects whether the processor can use the DSB, and fixing it cut the bottleneck roughly in half. But if you run a top-down analysis again, you’ll still see some DSB stalls. That’s because there are other conditions that also influence DSB usage. In the next video, I’ll dive into one of those remaining conditions and show how to eliminate more of the bottleneck. In the meanwhile, why don’t you experiment and see if you can identify and fix it yourself?

Big O vs Hardware: Better Complexity ≠ Better Performance

Abhinav Upadhyay — Sun, 03 Aug 2025 18:37:16 GMT

Big O vs Hardware

In algorithm design, we often rely on time complexity to compare solutions. It tells us how the work done by an algorithm grows with the size of its input. But real-world performance also depends on how well the code runs on hardware.

In a previous article, we explored the Iron Law of performance, which states that a program’s performance on the hardware depends on two factors:

Instruction count: fewer instructions usually mean faster execution.
Instructions per cycle (IPC): the more instructions a CPU can retire per cycle, the better.

By definition, algorithms with better time complexity tend to do less work. But they don’t always perform better if that work is harder for the CPU. A lower instruction count doesn’t help if each instruction is expensive or slows down IPC.

In this article, we’ll see a concrete example of this tradeoff. We’ll compare three algorithms for computing the greatest common divisor (GCD), study their time complexities, benchmark their real-world performance, and use the Iron Law to understand what’s really going on. As we will see, having a better time complexity is not a guarantee of better performance, a hardware friendly implementation also matters.

Euclid’s Subtraction-based Algorithm for Computing GCD

The first algorithm we will study is Euclid's subtraction-based algorithm for computing the GCD of two integers.

The GCD of two integers a and b is defined as the greatest integer that divides both a and b. For example, the GCD for 12 and 8 is 4. Euclid's subtraction-based algorithm for computing this is shown in the following code block.

gcd(long a, long b)
{
    while (a != b) {
        if (a > b) {
            a -= b;
        } else {
            b -= a;
        }
    }
    return a;
}

The algorithm keeps removing the smaller number from the larger one until both of them become equal. Let’s trace an example with a = 84, b = 18.

A trace of the steps taken by the subtraction-based GCD algorithm for a=84, b=18

As you can see in the above table, the algorithm converges in 7 steps.

Now, let’s think about the worst case time complexity of this algorithm. At each step the algorithm converges towards the GCD value by a step size that is equal to the difference between a and b. The smallest step possible is 1 when either a=1 or b=1. In that case, the algorithm will take as many steps as max(a, b), giving us the worst case time complexity as O(max(a, b))

In simpler terms, the algorithm grows linearly as the difference between a and b grows. For cases, when a and b are nearby, the algorithm converges quickly but when a and b are far apart, it will take a large number of steps.

This subtraction-based approach works, but there's a version that converges faster using division instead of repeated subtraction. Let’s take a look at that.

The Modulo-based Euclidean Algorithm for GCD

A more efficient variation of the Euclidean algorithm replaces repeated subtraction with division, reducing the number of steps required. The following snippet shows the code.

gcd(long a, long b)
{
    while (b != 0) {
        long t = b;
        b = a % b;
        a = t;
    }
    return a;
}

We can trace the same input to see how this version behaves.

A trace of the steps taken by the module-based GCD algorithm for a=84, b=18

In this case, the algorithm converges in just three steps as compared to seven taken by the subtraction-based algorithm. It is easy to see that at each step the algorithm converges towards the solution by performing a division between a and b, which leads to the worst case time complexity of O(log(max(a, b))).

These time complexities give us a theoretical bound of the scale of these algorithms but the actual performance can be measured only by running on real hardware. Let’s run both algorithms on large inputs to see how their time complexity translates to actual performance.

Benchmark #1: Huge Inputs

As an example, let’s compare the performance of the two algorithms on the input a=1000000000 and b=9223372036854775503. Following figure shows the timing and other high-level performance metrics using the Linux perf tool.

perf stat output for the subtraction-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503

The subtraction-based algorithm took 63,34,37,507 steps and ran in 2,230.71 milliseconds, consuming 9.28 billion CPU cycles. It executed 55.35 billion instructions at a rate of 5.96 instructions per cycle.

Now, let’s run the modulo-based algorithm and see how that performs.

perf stat output for the modulo-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503

As expected, the modulo-based algorithm is dramatically faster, converging in just 22 steps. The CPU executes this in just 0.28 milliseconds, roughly 10,000x faster than the subtraction-based algorithm. It executed only about 1 million instructions and cycles.

This is algorithmic efficiency in action. However, it is not the ultimate truth. The performance of these algorithms also depends on the efficiency of the hardware-level operations. Let’s talk about that for a moment.

Cost of Integer Add vs Integer Division in Hardware

Time complexity tells us how the CPU workload scales with input size. Specifically, the work in the case of these two algorithms refers to subtraction and modulo operations. Inside the CPU, subtraction is performed by the execution unit that performs addition and modulo translates into integer division which is handled by a different execution unit.

So, the performance of these algorithms also depends on the efficiency of these fundamental operations. When talking about the efficiency of CPU instructions like these, we usually care about two aspects:

Latency: How many cycles does the CPU take to execute that instruction. The lower the latency, the better.
Throughput: How many of those instructions can the CPU execute each cycle. A processor may be able to execute more than one instruction of a certain type per cycle. Some instructions are pipelined. Even if one takes multiple cycles to finish, the CPU can begin executing another of the same type in the meantime. In addition to that, the processor may have multiple execution ports that can execute that operation in parallel.

On Intel Skylake processors, an add instruction has a latency of 1 cycle and a throughput of 4 operations per cycle because the hardware has four execution ports capable of performing integer addition. Often, the compiler takes advantage of this high throughput by unrolling loops when it notices the loop involves addition operations.

On the other hand, integer division is very expensive. On Intel Skylake, it has a latency of 42-95 cycles. Unlike integer addition, there is only one execution port capable of performing integer division, as a result you can execute only one integer division operation every 24-90 cycles (as per Agner Fog’s optimization manual).

This contrast in the performance of these operations brings the Iron Law into the picture. The subtraction-based algorithm will always execute a higher number of instructions, but it will also have a better IPC. On the other hand, the modulo-based algorithm will execute less number of instructions, but it will have a very poor IPC. Performance depends on the tradeoff between instruction count and IPC. Let’s do another benchmark to see this play out.

Benchmark #2: Small Inputs

In this next benchmark, we use inputs that are close in value to observe how IPC can dominate when instruction counts are similar. The following table summarizes the performance of these two algorithms for the input where a=130000, b=13.

A side-by-side comparison of the subtraction and modulo based GCD algorithms for the input: a=130000, b=13. The numbers were obtained using perf stat

The subtraction-based algorithm executes 9,999 steps as compared to the modulo-based algorithm that converges in a single step. Despite taking 10,000 steps, the subtraction-based algorithm finishes 1 millisecond faster.

If we apply the Iron Law lens, we can see that the subtraction-based algorithm executed a slightly higher number of instructions, but it had a slightly better IPC as well which tipped the performance in its favor.

This isn’t to say algorithmic complexity doesn’t matter, at large scales, it absolutely does. But, a lot of the workload may never hit those scales and the implementation needs to take that into account. For example, many sorting routines use quicksort for large arrays, but switch to insertion sort for smaller sizes (e.g. fewer than 5 elements) because it’s simpler and faster in that regime. Similar strategy can be employed here. For values with smaller gap between them, subtraction-based algorithm can be preferred, while for values with larger difference, modulo-based algorithm can be used.

An alternative to switching between algorithms is to use an implementation that’s inherently hardware-friendly. In 1967, Stein designed such an algorithm that takes advantage of binary representation of integers and leverages bit shift operations that are very fast at the hardware-level. Let’s first understand how it works before comparing performance.

Stein’s Binary Algorithm for GCD

Stein designed this algorithm based on certain observations about GCD computation. These are as follows:

gcd(a, b) = gcd(b, a)
gcd(0, b) = b, and gcd(a, 0) = a
gcd(2a, 2b) = 2 * gcd(a, b), i.e., if a and b are even then we can compute the GCD of their halves and then multiply the result by 2.
gcd(a, 2b) = gcd(a, b), i.e., if b is odd then 2 is not a common divisor.
gcd(a, b) = gcd(a, b - a) when a and b are odd and a < b.

These observations lead to a recursive algorithm that reduces the inputs until b == 0. However, unlike the modulo-based algorithm, this algorithm can be highly optimized for real hardware. For example, most implementations use the following optimization tricks.

Instead of recursion, iteration is preferred.
Every step of the loop has to reduce a and b to odd values. This requires repeated division by 2. It turns out division by 2 can be done by the right bit shift operation which is much faster to perform than division.
The algorithm needs to check if the numbers are odd or not. Dividing by 2 and checking the remainder is not efficient because division is slow and introduces branches. A more efficient trick is to always ensure that the least significant bit (LSB) of these numbers is 1. This can be done by counting the number of trailing zero bits in the number, and right shifting by that many bits. Most processors have a dedicated instruction to count the number of trailing zero bits, so this is extremely cheap to do.

The following code block shows the C implementation of this algorithm (it uses the GCC builtin __builtin_ctzl to count the number of trailing zero bits).

gcd(long a, long b)
{
    // base conditions
    if (a == 0)
        return b;
    if (b == 0)
        return a;

    // gcd(2^i * a, 2^j * b) = 2^k * gcd(a, b)
    // where k = min(i, j)
    int k = __builtin_ctzl(a | b);

    // make a odd
    a >>= __builtin_ctzl(a);
    while (b != 0) {
        // make b odd
        b >>= __builtin_ctzl(b);
        
        // ensure b > a
        if (a > b) {
            long temp = a;
            a = b;
            b = temp;
        }
        // gcd(a, b) = gcd(a, (b-a)/2))
        // the division by 2 happens in the next iteration
        b = b - a;
    }
    // multiply 2^k back into the final gcd value
    return a << k;
}

At each step of the iteration, the algorithm converges by dividing b by 2, so the worst case time complexity is O(log_2(max(a, b)), which is the similar as the modulo-based algorithm. But this algorithm has the advantage that it uses efficient hardware instructions. This means that the algorithm executes fewer instructions while maintaining a higher IPC, which is a win over the modulo algorithm.

With the theory and hardware considerations in place, let’s now compare all three algorithms side by side.

Benchmark #3: Comparing All Three Algorithms

First, let’s revisit the large-input benchmark: a=1,000,000,000 and b=9,223,372,036,854,775,503. The following table summarizes the key findings.

We see that the binary algorithm performs just as well as the modulo-based algorithm even though it takes a few extra steps to converge.

We can also analyse this using the Iron Law.

The subtraction algorithm executes 92 billion instructions, which is vastly higher than the ~1 million executed by the other two.
The IPC of the subtraction algorithm is very high but the high instruction count dominates because of which it takes significant amount of time to finish.

A side-by-side comparison of the performance of the three GCD algorithms for the input: a=1,000,000,000 and b=9,223,372,036,854,775,503

Next, let’s compare the performance for the 2nd input where a=130000, and b=13. The following table shows the numbers. In this case, we see that the binary algorithm matches the performance of the subtraction-based algorithm. This happens because not only it executes fewer instructions but uses efficient instructions, leading to a higher IPC. It strikes the ideal balance from the Iron Law’s perspective.

A side-by-side comparison of the performance of the three GCD algorithms for the input: a=130000 and b=13

But comparing the performance on just two inputs is not enough. The following table shows the performance of these algorithms on a benchmark where they compute the GCD for all unique combination of values in the range [1, 100000).

Results of a larger benchmark that ran the three algorithms to compute the GCD of all combination of integers in the range [1, 100000) at increments of 5.

Let’s analyse this for a moment:

The binary algorithm is the fastest, while the modulo-based algorithm is the slowest despite having the same time complexity as the binary algorithm. This highlights that a superior complexity is not the only thing that gets you performance, the hardware specific implementation is also crucial.
The modulo-based algorithm executed 13 billion instructions (the lowest) while the subtraction-based algorithm executed 97 billion instructions (the highest). Yet, the subtraction-based algorithm finished 1.5 seconds earlier. It highlights how efficient integer add operation is in processors as compared to division.
The subtraction-based algorithm had an IPC of 1.16 (the highest), while the modulo-based algorithm had the lowest IPC of 0.15.
The binary algorithm doesn’t win in instruction count or IPC. But it outperforms in the overall execution time because from the Iron Law’s point of view, it strikes the right balance between the two factors. In fact, if you benchmark these algorithms on a wide range of values, the binary algorithm always gives a consistent performance, while the performance of the other two algorithms can vary depending on the inputs.

Conclusion

Algorithmic time complexity is important, but real-world performance also depends on how well an algorithm maps to the underlying hardware. Doing less work doesn’t help if that work is inefficient or poorly suited to the CPU. The fastest implementations are those that align with the strengths of the hardware, such as low-latency instructions, high IPC, and predictable execution.

Code

All the code used in the analysis behind this article can be found in the GitHub repo linked below.

Check out the code

x86 Assembly Exercise #1: Toy kill Program (Solution)

Abhinav Upadhyay — Sat, 19 Jul 2025 18:54:05 GMT

This is a short video as part of our series on x86-64 assembly. If you have not been following the series, you can start with the series overview.

In this video post we will be discussing the solution to the homework exercise I gave at the end of the post on system calls. The objective of the exercise was as follows:

Write a toy implementation of the kill command with a few simplifications
- Hard code the process id to the pid of any running process on your system
- Hard code the signal number to 9 (for SIGKILL)
- Exit the program with the return value of the kill system call

If you are yet to try out this exercise for yourself, then I highly recommend that you do it on your own and then come back to this video to verify your solution.

My aim with such exercises is to give you a taste of systems programming along with teaching assembly. With this combined knowledge of assembly and how things work under the hood, you will be well placed to tackle serious projects in higher-level languages such as C, Rust, Go, Java, etc.

As always, feel free to comment here or reach out to me on email if you have any questions.

Understanding Registers and Data Movement in x86-64 Assembly

Abhinav Upadhyay — Wed, 16 Jul 2025 12:19:08 GMT

“In the beginning, there was a word. Then came the doubleword, and finally the quadword.”

Registers in x86-64

This article is part of our series on x86-64 assembly. So far we have learned to write simple programs that can move some data around and invoke system calls. For the complete list of articles published so far in this series, check out the series overview.

Understanding Computer Organization from First Principles
Bits, memory, and the logic behind modern computing. A gentle dive into the foundations.
Binary Arithmetic and Bitwise Operations for Systems Programming
Signed numbers, two's complement, masking tricks, and bit-level manipulations that matter.
The System-Level Foundation of Assembly
How your code goes from main() to a running process, and where assembly fits in.
Building (and Breaking) Your First X86 Assembly Program
A minimal working program from scratch, with no runtime or C library. Learn by breaking it apart.
Debugging X86-64 Assembly with GDB
Hands-on debugging walkthroughs to inspecting registers, memory, and control flow.
Making System Calls in x86-64 Assembly
How to interact with the operating system directly using syscalls without a C runtime.

I’m also publishing this in the form an ebook (PDF). If you don’t wish to upgrade to a subscription, you can purchase the PDF using the following link. If you are a paid subscriber you can get it at a discount (monthly subs: 20% and annual subs: 50%), please email me for the discounted link.

Purchase Ebook

Introduction

Now that we've written and debugged a few x86-64 assembly programs, it's time to take a closer look at one of the most fundamental pieces of the architecture: the general-purpose registers.

Rather than throwing a table of names and sizes at you, we'll build up a mental model of how these registers evolved, starting from the 8086 and leading up to modern 64-bit hardware. That historical context makes it much easier to understand the naming conventions and relationships, so you're not constantly wondering where things like sil or r8d came from.

The article also includes hands-on exercises to help you understand how values move between registers of different sizes, and to develop an intuition for how partial registers behave. Along the way, we’ll also cover some of the edge cases and architectural quirks. These often overwhelm beginners, but I’ve tried to present them in the right context, so they’re easier to understand and less likely to trip you up.

Registers in the 16-bit Era

The x86 architecture formally began life with the 8086 processor, which was a 16-bit machine. This meant that it had 16-bit wide registers, and its instructions could operate on values up to 16 bits in size.

The general-purpose registers were named after the first four letters of the alphabet: ax, bx, cx, and dx.

8-bit Register Halves

While these registers could work with 16-bit values, there was also a need to handle 8-bit data. Using bitwise masks to access just the higher or lower 8 bits would have been cumbersome and inefficient, requiring extra instructions. To solve this, the 8086 architecture introduced alternate names to refer directly to the upper and lower 8-bit halves of the 16-bit registers.

The naming was logical: replace the "x" in the 16-bit register name with "h" for the high byte or "l" for the low byte. For example, ah refers to the high 8 bits of ax, and al refers to the low 8 bits.

The following diagram shows the full set of general-purpose registers in the 8086, including how the 8-bit halves map onto the 16-bit registers:

The breakdown of 16-bit registers and their 8-bit halves in the 8086 processor

Word Size and Instruction Suffixes

If you remember, when we wrote our first x86-64 assembly program, we wrote the following instruction:

movq $32, %rdi

Here, mov is the instruction, and the q suffix stands for "quadword", which in x86-64 means 64 bits.

x86 uses suffixes to indicate operand sizes: 8-bit, 16-bit, 32-bit, and 64-bit. These suffixes evolved along with the architecture, and we'll explore them as we move from 16-bit to 64-bit.

You're right to think that if a quadword is 64 bits, then a word must be 16 bits. The 8086 was a 16-bit processor, and as a result its word size was also 16 bits. In computer architecture, the word size is the number of bits of data that the processor can handle in a single operation. So, the assembly instructions for 8086 used the suffix “w" for 16-bit values.

Hands-on Exercise: Working with 16-bit Registers

Here’s an example that writes two 16-bit values into ax and bx, computes their difference, and exits.

.text

.globl _start
_start:
    # write two 16-bit values into ax and bx
    movw $100, %ax
    movw $58, %bx

    # compute the difference: ax = ax - bx
    subw %bx, %ax

    # exit with status code: 0    
    movq $60, %rax
    # xoring rdi with itself zeroes it
    xorq %rdi, %rdi
    syscall

Try running this inside gdb, and observe the values of the registers ax and bx after each instruction. You can use the following commands to do this:

p (short) $ax 
p (short) $bx

Note About the xor Instruction: In the above program, xorq %rdi, %rdi zeroes out the rdi register. This is a common and efficient trick: XOR-ing a register with itself always results in zero.

Hands-on Exercise: Working with 8-bit Registers

Let’s run a small program that helps you visualize how the ah and al 8-bit halves relate to the full 16-bit ax register.

.text
.globl _start

_start:
    # write a 16-bit value 0x1234 into ax
    movw $0x1234, %ax

    # copy the high 8 bits of ax into bl
    movb %ah, %bl

    # copy the low 8 bits of ax into ch
    movb %al, %ch

    # exit
    movq $60, %rax
    xorq %rdi, %rdi
    syscall

Try this in GDB, and inspect the values of %ax, %bl, and %ch after each instruction. You should see:

%ax contains 0x1234
%ah (upper byte of ax) is 0x12 → copied to %bl
%al (lower byte of ax) is 0x34 → copied to %ch

You can use the following commands to inspect the values of these registers:

p (short) $ax
p (char) $bl
p (char) $ch

Evolution to x86-32 Architecture

A Programmer’s Guide to x86-64 Assembly (Series Overview)

Abhinav Upadhyay — Wed, 16 Jul 2025 05:14:34 GMT

Welcome to my ongoing series on x86-64 assembly programming, designed for programmers who want to peel back the abstraction and understand how code really runs at the machine level.

Why should a software engineer care about assembly? Because understanding what's happening at the lowest level helps you write better code at every level. It sharpens your intuition about performance bottlenecks, compiler behavior, memory usage, and even security. Whether you're debugging a weird bug, chasing a perf regression, or just curious how high-level constructs boil down to machine instructions, assembly is the Rosetta Stone.

We start from first principles, covering bits, memory, and CPU instructions, and gradually build up the skills to read and write real-world assembly programs. Whether you're interested in systems programming, performance tuning, or just curious about what your compiler is really doing under the hood, this series is for you.

Published Posts

Understanding Computer Organization from First Principles
Bits, memory, and the logic behind modern computing. A gentle dive into the foundations.
Binary Arithmetic and Bitwise Operations for Systems Programming
Signed numbers, two's complement, masking tricks, and bit-level manipulations that matter.
The System-Level Foundation of Assembly
How your code goes from main() to a running process, and where assembly fits in.
Building (and Breaking) Your First X86 Assembly Program
A minimal working program from scratch, with no runtime or C library. Learn by breaking it apart.
Debugging X86-64 Assembly with GDB
Hands-on debugging walkthroughs to inspecting registers, memory, and control flow.
Making System Calls in x86-64 Assembly
How to interact with the operating system directly using syscalls without a C runtime.
Understanding Registers and Data Movement in x86-64 Assembly
Systematic coverage of the general-purpose registers in x86-64 architecture and how to move data between them.
x86 Addressing Modes, Part 1 — Immediate and Direct Access
Learn about static data allocation, and accessing memory using immediate and direct access modes. Setting up the foundation for the more advanced addressing modes in the upcoming articles. You will master these two addressing modes by implementing interesting exercises. For immediate addressing mode, you write your own implementation of the cat utility in x86 assembly and for direct memory addressing, you write a benchmarking program.

Upcoming Topics

Here’s a peek at what’s planned for future posts (subject to change based on feedback and curiosity):

Registers, stack, and calling conventions
Memory addressing and pointer arithmetic
Writing loops and conditionals in pure assembly
Implementing functions and recursion
A deeper dive into Linux syscalls (file I/O, process management, etc.)
Mini-project: writing a simple command-line utility
Capstone: building a minimal web server in assembly

You can subscribe to get new posts as they drop. I’m writing this series with care, making sure each part builds up your intuition as well as your skillset. Feel free to share, comment, or ask questions.

Subscribe now