Context Switching and Performance: What Every Developer Should Know
Understand how context switching affects CPU registers, caches, TLB, and pipeline performance, and learn strategies to mitigate performance penalties
Context switching is known to be one of the most expensive operations performed by the operating system kernel which can kill the performance of many systems. It is a necessary evil on a busy system to keep it responsive, and to allow all the processes to make progress. But what makes it so expensive? This article decodes the hardware and software dynamics underlying context switching.
Before discussing the cost of context switching, we should first examine the components of a process’s context, as they ultimately affect application performance due to context switching, either directly or indirectly.
Components of a Process’ Context
The CPU executes a process by fetching and executing its instructions. To know which instruction to execute next, it contains an instruction pointer register which contains the address of the next instruction.
The instructions themselves need data to operate on. That data needs to be in the registers or the cache, but it has to be brought into the CPU from the main memory for that.
There is a whole hierarchy of memory in the system and they interact with each other in intricate ways which plays a role in the performance of a process while it is executed, and also during the context switch. The following diagram shows the architecture of a multicore system highlighting this memory hierarchy.
Out of so many levels of memory, only the two extreme levels form a part of a process’ context: the registers and the main memory. The reason being that if the data in either of those levels is lost, it cannot be reconstructed. Whereas in the case of any of the caches, such as L1/L2/L3 or the TLB, they can always be refilled.
Let’s talk about registers and main memory in a bit more detail.
Registers
Most instructions supported by the X86 processor use at least one register as their operand. Often that operand data needs to be brought from memory into the register before the instruction can be executed.
Once a piece of data arrives into a register, a whole sequence of instructions might work with it. For instance, the current instruction might perform an operation to produce an intermediate result which sits in another (or the same) register. The subsequent instructions may work with that intermediate result. This sequence can repeat for a while until the final result is written back into memory.
As you can see, at any point of time, the data in these registers reflects the state of the process. If the process were to be context switched in the middle of execution, the state of registers is critical for its resumption later on. Because of this, the registers form one of the key parts of a process’ context.
Main Memory
Finally, we have the last level of the memory hierarchy: the main memory. Every process has its own virtual address space. This address space is organized in the form of virtual pages. Every address that a process refers to lies in one of these pages.
These virtual pages usually have a corresponding physical memory page frame ((but not always, see the note), where the data is actually stored. This mapping from virtual pages to physical page frames is organized in the form of page tables.
In order to be able to address a large amount of memory, the page tables need to be several levels deep. On modern Linux these can be 4-5 level deep.
The diagram below shows the structure of a two-level page table. The root of the page table is called the page directory. The entries in it index into the next level. The final level of the page table acts as an index into the physical memory.
The hardware needs to know where the page table lives in the physical memory so that when a request for a specific virtual address arrives, it knows how to get the corresponding physical address.
For this, the kernel stores the physical address of the page directory in the CR3 register (on X86 hardware). As every process has its own page table, during a context switch, the kernel needs to save the page table address of the previous process, and update the CR3 register with the page table address of the new process. The page table also forms a crucial part of a process’ context.
Note: It is not always the case that a virtual page is mapped to physical memory. Many times the pages are mapped on demand , and sometimes when using memory mapped files, the pages are mapped to file data on the disk.
It is clear that there are two key pieces to a process’ hardware context: its register state and its page table which need to be saved or switched during context switch. Now let’s talk about when and how this context switch happens.
The Process of Context Switching
The kernel usually performs a context switch when the current process has consumed its allotted slice of the CPU time. Apart from that, the kernel may also context switch the process if the process performs a blocking operation which may take a long time to finish. This is not an exhaustive list of scenarios when a context switch happens but two of the most common ones.
The exact mechanics involved in saving the context is quite convoluted and needs its own article, so we will not get into that part today. But the following code from the Linux kernel shows what happens during a context switch.
As you can see in the code, the kernel performs the context switch in two steps:
Page table switch: Switches the CR3 register values
Saving and Restoring Registers: Saves current tasks’ registers on the kernel stack, switches to the new task’s kernel, and finally restores its registers
The actual low level implementation details behind context switch are very interesting, we will discuss those in another article. But if you are interested in learning on your own, see chapter 3 of Understanding the Linux Kernel, 3rd Edition.
With all of that background behind context switching, we are ready to discuss its impact on the performance of applications.
Performance Cost of Context Switching
The overall impact of context switching is hard to quantify exactly because it depends on the specific interaction between the hardware and software. For instance, the following infographic puts its cost into a really large bracket of 10,000 to 1 million CPU cycles.
When we want to talk about the cost of context switching, there is a direct cost and an indirect cost associated with it.
The direct cost is simply the amount of time spent in doing the context switch itself. This includes the time taken by the kernel to find the next runnable process, then saving the state of the current process, and finally restoring the state of the new process.
This direct cost is usually constant and doesn’t change drastically from one context switch to another. A few studies have measured it to be around 1000-2000 CPU cycles (see references [1] and [2] in the References section), but those are quite old and the latest numbers on newer hardware and kernel might be slightly different. But the major point is that this is not the most expensive part of a context switch.
A context switch impacts the state of various hardware resources which eventually end up hurting the performance of the process indirectly. This is the indirect cost of context switching. Let’s understand the indirect costs by going through each component of the hardware which is affected by context switches.
Performance Impact Due to TLB Contention
The instructions in a process refer to the data in main memory by their virtual addresses, but the processor needs to know their physical address in order to retrieve them from the physical memory. For this it needs to translate the virtual address into physical address by walking the page table, and that can be very expensive. In the worst case it may involve 4-5 memory accesses and overall hundreds of cycles. To save this cost, there is a small cache of recently translated addresses in each core, called the translation lookaside buffer (TLB).
Because the TLB contains data related to the virtual memory of the current process, in some systems it may need to be flushed during context switches.
However, these days most processors (such as ARMv7+, ARM64, X86-64) support address space identifiers (ASID) or process context identifier (PCID), which is a unique identifier associated with each process. The cache entries in the TLB are tagged with the PCID of the process which ensures that one process cannot read the cached entries of another process. This way, the TLB can store entries of multiple processes at the same time and doesn’t need to be flushed during a context switch.
That said, the TLB cache entries of the old process may still get evicted as a result of contention for space from other processes.
The cost of a TLB miss is very non-deterministic. It can be as small as 10-20 CPU cycles, (e.g. if the L2 TLB has the data), or it can take hundreds of cycles if the hardware needs to perform a page table walk (this stackoverflow answer provides many pointers on the cost of TLB miss).
The bottom line is that TLB misses can be very problematic for performance and processes with large address spaces may want to avoid paying for this penalty.
Performance Impact Due to Cache Contention
The CPU caches are critical for the performance of any software. The latency of an L1 cache hit is 3-4 CPU cycles, while the latency of reading from the main memory is ~200 CPU cycles, i.e. ~50x slower.
These caches contain data belonging to the address space of a specific process. When another process is running on the CPU, we don’t want it to be able to read another process’ memory. To prevent this from happening the caches may need to be flushed. But that depends on the type of cache being used in the hardware. There are four possible types of caches depending on how they index and tag the data.
The index determines how the hardware locates data in the cache. Typically, the hardware maps the memory address being looked up to one of the cache line entries using a simple address mapping function that extracts certain bits from the address. The mapping can be based on either the virtual address or the physical address of the data.
A tag is a portion of the memory address stored alongside the cached data to help uniquely identify it. The hardware uses the tag to verify that the cache line contains the data requested by the process. Since multiple memory addresses can map to the same cache line, the tag helps resolve these ambiguities. Depending on the cache design, the tag may be derived from either the virtual address (requiring flushing during context switches) or the physical address (which often reduces the need for flushing).
Based on the various combinations of the index and tags, there are four types of caches which are possible and their characteristics define whether flushing is necessary after a context switch.
Physically Indexed, Physically Tagged (PIPT): These caches are indexed using the physical address and require an address translation for each read or write to the cache. As a result, they are quite slow. However, these types of caches do not require flushing during context switches because two different processes will have different physical address space.
Virtually Indexed, Virtually Tagged (VIVT): These caches are much faster because reading or writing from them does not require address translation. However, in the case of context switches they may require flushing to prevent the new process from reading the previous process’ memory. A solution to avoid flushing is to tag the cache data with the PCID of the processes.
Virtually Indexed, Physically Tagged (VIPT): This is the most commonly used cache type in modern hardware. Because they are physically tagged, an address translation is still required for validating the cache entry, but the hardware can issue the cache lookup and address translation requests in parallel. Because of physical tagging, these caches do not require flushing during context switches.
Physically Indexed, Virtually Tagged: These kinds of caches do not exist and not used in the real-world, so we will not talk about them here.
Long story short, on modern processors, the CPU data cache is usually not flushed during a context switch (assuming they are VIPT type caches). However, the CPU caches are very small in size. As a result of this, some or all of the cache lines belonging to the previous process may get evicted by the time it is scheduled back on CPU, which may have a huge impact on its performance.
Performance Impact Due to CPU Pipeline Flush
Although the TLB and cache flushes during a context switch may get avoided on modern systems, an unavoidable event is the flushing of the CPU pipeline.
A pipelined processor splits the execution of a single instruction into several stages, much like the assembly line in a car manufacturing factory. In each cycle, an instruction moves to the next stage, making space for a new instruction in the first stage. This way the processor can continue to issue one new instruction every cycle and when the pipeline is full, it can also retire one instruction every cycle. A full pipeline reflects the optimum usage of the available execution resources on the CPU.
After a context switch, the process incurs a performance penalty due to the pipeline flush. The time required to refill the pipeline depends on the processor architecture and pipeline depth, typically ranging from 10 to 50 cycles in modern CPUs. During this period, the CPU operates below its peak efficiency, as instructions cannot be retired at full throughput. This results in a temporary performance overhead, with potential instruction throughput losses. Pipeline stalls represent a critical hidden cost of context switching, particularly in high-frequency switching environments, where the cumulative overhead can significantly impact overall system performance.
Performance Impact Due to Branch Predictor State
Modern CPUs achieve high performance by leveraging instruction-level parallelism and out-of-order execution to process multiple instructions simultaneously. However, this introduces challenges when executing conditional branch instructions. When the outcome of a branch condition is still being computed by another instruction, the CPU must decide whether to execute one of the possible paths speculatively.
To keep the pipeline full and avoid stalling, CPUs use branch predictors, which estimate the branch outcome based on historical patterns of the specific branch instruction.
Correct Prediction: If the branch predictor guesses correctly, the CPU continues executing instructions at a high throughput.
Misprediction: If the prediction is wrong, the CPU must discard the speculatively executed instructions by flushing the pipeline. The pipeline is then refilled with instructions from the correct branch target, causing a performance penalty similar to a pipeline flush.
Branch prediction units use structures like the Branch Target Buffer (BTB) and Branch History Buffer (BHB) to track the branching history of instructions. These structures have limited capacity, meaning they can only maintain branching pattern information for a finite number of instructions.
During a context switch:
Branch Predictor State: The predictor state is not explicitly flushed, but the new process gradually overwrites the history of the previous process as it executes its own branches.
Impact on Performance: When the old process resumes, its branch predictor history may have been partially or fully replaced. This can result in a higher rate of branch mispredictions initially, as the predictor must relearn the branching patterns of the resumed process. This rebuilding phase can degrade performance until the predictor adapts again.
Speculative Execution Vulnerabilities and their Impact on Context Switching
Before wrapping up this article, it's important to discuss speculative execution vulnerabilities, as they are a hot topic these days. Over recent years, we've seen many of these vulnerabilities emerge in modern processors, with Spectre and Meltdown being among the most notable. These vulnerabilities exploit the processor's speculative execution capabilities—like branch prediction and speculative data prefetching—to access out-of-bounds data temporarily.
During context switches, the hardware might inadvertently leave behind speculative execution results in caches or buffers. With some clever engineering, attackers could use this to peek into another process’s memory, potentially grabbing sensitive data like private keys or passwords.
To mitigate these attacks, it became essential that one process should not be able to access another process’ memory in any case. It led to kernel level mitigations which disabled or flushed the speculative execution hardware during context switches. This included TLB flushes, L1 data cache flushes, branch predictor buffer flushes.
However, the necessity for such measures can vary. Over time, processor manufacturers have introduced microcode updates and hardware improvements that reduce the need for these flushes, especially during regular process context switches.
Here's how it typically stands during process switches in the Linux kernel:
Branch Predictor Buffers: Even though branch predictor buffers were one of the primary mechanisms behind these attacks and the mitigations require them to be flushed, this has a very high impact on the overall performance of the system. As a result, the Linux kernel has made it a configurable setting. By default these buffers are not flushed, but it can be turned on systems where untrusted processes can run.
TLB Flushes: Only required if the processor lacks support for address space identifiers (ASIDs). Other circumstances may trigger TLB flushes, but these are unrelated to speculative execution mitigations.
L1 Data Cache: Like branch predictor buffers, the L1 data cache isn't typically flushed during context switches because of high performance cost. On untrusted systems, the kernel can be configured to perform a flush during context switches.
As always, these things keep changing as new vulnerabilities are found which require new mitigations.
Final Thoughts
Context switching is a necessary evil in an operating system to provide a high throughput and responsive system. But, it can have a dreadful impact on the performance of the processes.
The performance costs of context switching are non-deterministic and depend on the complex dynamics of the different hardware components, and the software. In systems, where low tail latencies are critical, this non-deterministic factor can become problematic.
A few ways to avoid a critical process from being context switched is by pinning it onto a specific CPU, ensuring nothing else runs on that core and setting the priority high for that process.
Another alternative is to use user space threads and a user space scheduler. For instance, the Golang runtime maps several goroutines to a single OS thread and schedules the goroutines in the user space. Even though the cost of doing context switching in user space is higher than kernel space context switching, it is much more deterministic which plays well in systems where a consistent tail latency is desired.
Summary Notes
The hardware context of a process that needs to be saved and restored during context switch mainly consists of its register state, and address space (or the page tables).
Context switching can happen for many reasons but the most common ones are due to the process exhausting its allotted CPU time, or the process blocking itself.
There are two performance costs associated with a context switch. The direct cost is basically the amount of work required to perform the context switch. The indirect costs are due to the aftermath of context switch and the lost state of the process in the CPU caches, instruction pipeline, branch predictor and TLB cache.
TLB flush is usually not required during a context switch because most modern processors support a unique identifier for processes, called the address space identifier (ASID) or process context identifier (PCID) which is used to tag the TLB entries. However, the previous process’ TLB entries may get evicted due to contention for space in the TLB cache which can have drastic impact on its performance.
CPU caches are also usually not required to be flushed in modern hardware because most of them are virtually indexed, physically tagged. However, the cachelines of the previous process may get evicted due to contention for space, again hampering its performance.
The CPU pipeline gets flushed during context switch and the process has to fill it up from the start which may take several cycles to reach its peak throughput, resulting in underutilization of the processor resources and reduced performance.
The branch predictor states are also not flushed during context switch, but the history of the previous process may get lost due to contention for space. As a result the old process may encounter a lot of branch misses while its branch history is being rebuilt.
References
Support Confessions of a Code Addict
If you find my work interesting and valuable, you can support me by opting for a paid subscription (it’s $6 monthly/$60 annual). As a bonus you get access to monthly live sessions, and all the past recordings.
Subscribed
Many people report failed payments, or don’t want a recurring subscription. For that I also have a buymeacoffee page. Where you can buy me coffees or become a member. I will upgrade you to a paid subscription for the equivalent duration here.
I also have a GitHub Sponsor page. You will get a sponsorship badge, and also a complementary paid subscription here.
Thanks abhinav for this. I was wondering does this mean that perf penatly gap between thread level context switch and process context switch has reduced ?
As I am assuming tlb/cahce are not getting flushed in process context switch given these are using process identifier as tag.