Context Switching and Performance: What Every Developer Should Know

Understand how context switching affects CPU registers, caches, TLB, and pipeline performance, and learn strategies to mitigate performance penalties

Dec 12, 2024

Context switching is known to be one of the most expensive operations performed by the operating system kernel which can kill the performance of many systems. It is a necessary evil on a busy system to keep it responsive, and to allow all the processes to make progress. But what makes it so expensive? This article decodes the hardware and software dynamics underlying context switching.

TL;DR

The hardware context of a process primarily consists of its register state and the virtual memory represented by the page table.
Context switching can happen for many reasons but the most common ones are due to the process exhausting its allotted CPU time, or the process blocking itself.
There are two performance costs associated with a context switch.
- Direct costs: It is the amount of work required to perform the context switch.
- Indirect costs: These are due to the aftermath of context switch and the lost state of the process in the CPU caches, instruction pipeline, branch predictor and TLB cache.
The L1 data and instruction caches are not flushed explicitly by the kernel, but the entries might get evicted due to contention and as a result the performance of the process will be impacted when it is resumed back.
The TLB caches the recent virtual to physical address translations required by the process. A TLB miss can be very costly, leading to a waste of hundreds of cycles. Linux does not explicitly flush TLB these days, thanks to address space identifier (ASID) - a unique process identifier used to tag the entries. But this is not guaranteed and on a busy system the kernel may have to flush it.
The CPU pipeline gets flushed during context switch and the process has to fill it up from the start which may take several cycles to reach its peak throughput, resulting in underutilization of the processor resources and reduced performance.
The branch predictor states are also not flushed during context switch, but the history of the previous process may get lost due to contention for space. As a result the old process may encounter a lot of branch misses while its branch history is being rebuilt.

Before discussing the cost of context switching, we should first examine the components of a process’s context, as they ultimately affect application performance due to context switching, either directly or indirectly.

Components of a Process’ Context

The CPU executes a process by fetching and executing its instructions. To know which instruction to execute next, it contains an instruction pointer register which contains the address of the next instruction.

The instructions themselves need data to operate on. That data needs to be in the registers or the cache, but it has to be brought into the CPU from the main memory for that.

There is a whole hierarchy of memory in the system and they interact with each other in intricate ways which plays a role in the performance of a process while it is executed, and also during the context switch. The following diagram shows the architecture of a multicore system highlighting this memory hierarchy.

A high level architecture of a typical multicore processor

Out of so many levels of memory, only the two extreme levels form a part of a process’ context: the registers and the main memory. The reason being that if the data in either of those levels is lost, it cannot be reconstructed. Whereas in the case of any of the caches, such as L1/L2/L3 or the TLB, they can always be refilled.

Let’s talk about registers and main memory in a bit more detail.

Registers

Most instructions supported by the X86 processor use at least one register as their operand. Often that operand data needs to be brought from memory into the register before the instruction can be executed.

Once a piece of data arrives into a register, a whole sequence of instructions might work with it. For instance, the current instruction might perform an operation to produce an intermediate result which sits in another (or the same) register. The subsequent instructions may work with that intermediate result. This sequence can repeat for a while until the final result is written back into memory.

As you can see, at any point of time, the data in these registers reflects the state of the process. If the process were to be context switched in the middle of execution, the state of registers is critical for its resumption later on. Because of this, the registers form one of the key parts of a process’ context.

Main Memory

Finally, we have the last level of the memory hierarchy: the main memory. Every process has its own virtual address space. This address space is organized in the form of virtual pages. Every address that a process refers to lies in one of these pages.

These virtual pages usually have a corresponding physical memory page frame ((but not always, see the note), where the data is actually stored. This mapping from virtual pages to physical page frames is organized in the form of page tables.

In order to be able to address a large amount of memory, the page tables need to be several levels deep. On modern Linux these can be 4-5 level deep.

The diagram below shows the structure of a two-level page table. The root of the page table is called the page directory. The entries in it index into the next level. The final level of the page table acts as an index into the physical memory.

An illustration of a 2-level page table with mappings into the physical memory

The hardware needs to know where the page table lives in the physical memory so that when a request for a specific virtual address arrives, it knows how to get the corresponding physical address.

For this, the kernel stores the physical address of the page directory in the CR3 register (on X86 hardware). As every process has its own page table, during a context switch, the kernel needs to save the page table address of the previous process, and update the CR3 register with the page table address of the new process. The page table also forms a crucial part of a process’ context.

Note: It is not always the case that a virtual page is mapped to physical memory. Many times the pages are mapped on demand , and sometimes when using memory mapped files, the pages are mapped to file data on the disk.

It is clear that there are two key pieces to a process’ hardware context: its register state and its page table which need to be saved or switched during context switch. Now let’s talk about when and how this context switch happens.

The Process of Context Switching

The kernel usually performs a context switch when the current process has consumed its allotted slice of the CPU time. Apart from that, the kernel may also context switch the process if the process performs a blocking operation which may take a long time to finish. This is not an exhaustive list of scenarios when a context switch happens but two of the most common ones.

The exact mechanics involved in saving the context is quite convoluted and needs its own article, so we will not get into that part today. But the following code from the Linux kernel shows what happens during a context switch.

The context_switch function from the Linux scheduler

As you can see in the code, the kernel performs the context switch in two steps:

Page table switch: Switches the CR3 register values
Saving and Restoring Registers: Saves current tasks’ registers on the kernel stack, switches to the new task’s kernel, and finally restores its registers

The actual low level implementation details behind context switch are very interesting, we will discuss those in another article. But if you are interested in learning on your own, see chapter 3 of Understanding the Linux Kernel, 3rd Edition.

With all of that background behind context switching, we are ready to discuss its impact on the performance of applications.

Performance Cost of Context Switching

The overall impact of context switching is hard to quantify exactly because it depends on the specific interaction between the hardware and software. For instance, the following infographic puts its cost into a really large bracket of 10,000 to 1 million CPU cycles.

An infographic comparing the costs of various operations in the CPU. Source: http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

When we want to talk about the cost of context switching, there is a direct cost and an indirect cost associated with it.

The direct cost is simply the amount of time spent in doing the context switch itself. This includes the time taken by the kernel to find the next runnable process, then saving the state of the current process, and finally restoring the state of the new process.

This direct cost is usually constant and doesn’t change drastically from one context switch to another. A few studies have measured it to be around 1000-2000 CPU cycles (see references [1] and [2] in the References section), but those are quite old and the latest numbers on newer hardware and kernel might be slightly different. But the major point is that this is not the most expensive part of a context switch.

A context switch impacts the state of various hardware resources which eventually end up hurting the performance of the process indirectly. This is the indirect cost of context switching. Let’s understand the indirect costs by going through each component of the hardware which is affected by context switches.

Performance Impact Due to TLB Contention

Whenever a process needs to access any data, it provides the virtual address of the data to the processor, but the processor needs to know the physical address in order to retrieve it from physical memory.

This translation of virtual to physical address is done by the hardware by walking the page table, which is a very expensive process, costing 4-5 memory accesses and several hundreds of wasted CPU cycles. (This cost depends on which parts of the page table are in the CPU caches and number of levels in the page table).

To prevent this costly page walk on every address translation, most processors come with the translation lookaside buffer (TLB), which is a small cache that caches recently translated page addresses.

Because the TLB contains data related to the virtual memory of the current process, in some systems it may need to be flushed during context switches.

However, these days most processors (such as ARMv7+, ARM64, X86-64) support address space identifiers (ASID) or process context identifier (PCID), which is a unique identifier associated with each process.

The cache entries in the TLB are tagged with the PCID of the process which ensures that one process cannot read the cached entries of another process. This way, the TLB can store entries of multiple processes at the same time and doesn’t need to be flushed during a context switch.

However, the PCID support doesn’t guarantee that TLB state of a process will remain when a process is resumed back after a context switch. If there are too many processes to run on the CPU, the older process’s entries may get evicted due to contention.

Also, the PCID is a small identifier, e.g. 2^12 on X64. On X64, the Linux kernel only keeps 6 unique PCID values per CPU and recycles them between processes. It means on a busy CPU with more than 6 processes to run, there is a likely chance that the TLB data of a process may be flushed when it is resumed back because its PCID was assigned to another process. I have a video on how this works in Linux, check it out.

Linux Context Switches: The Truth About TLB Flushes

Abhinav Upadhyay

Jan 17

Read full story

The bottom line is that TLB misses can be very problematic for performance and processes with large address spaces may want to avoid paying for this penalty.

Performance Impact Due to Cache Contention

The CPU caches are critical for the performance of any software. The latency of an L1 cache hit is 3-4 CPU cycles, while the latency of reading from the main memory is ~200 CPU cycles, i.e. ~50x slower.

These caches contain data belonging to the address space of a specific process. When another process is running on the CPU, we don’t want it to be able to read another process’ memory. To prevent this from happening the caches may need to be flushed. But that depends on the type of cache being used in the hardware. There are four possible types of caches depending on how they index and tag the data.

The index determines how the hardware locates data in the cache. Typically, the hardware maps the memory address being looked up to one of the cache line entries using a simple address mapping function that extracts certain bits from the address. The mapping can be based on either the virtual address or the physical address of the data.

A tag is a portion of the memory address stored alongside the cached data to help uniquely identify it. The hardware uses the tag to verify that the cache line contains the data requested by the process. Since multiple memory addresses can map to the same cache line, the tag helps resolve these ambiguities. Depending on the cache design, the tag may be derived from either the virtual address (requiring flushing during context switches) or the physical address (which often reduces the need for flushing).

Based on the various combinations of the index and tags, there are four types of caches which are possible and their characteristics define whether flushing is necessary after a context switch.

Physically Indexed, Physically Tagged (PIPT): These caches are indexed using the physical address and require an address translation for each read or write to the cache. As a result, they are quite slow. However, these types of caches do not require flushing during context switches because two different processes will have different physical address space.
Virtually Indexed, Virtually Tagged (VIVT): These caches are much faster because reading or writing from them does not require address translation. However, in the case of context switches they may require flushing to prevent the new process from reading the previous process’ memory. A solution to avoid flushing is to tag the cache data with the PCID of the processes.
Virtually Indexed, Physically Tagged (VIPT): This is the most commonly used cache type in modern hardware. Because they are physically tagged, an address translation is still required for validating the cache entry, but the hardware can issue the cache lookup and address translation requests in parallel. Because of physical tagging, these caches do not require flushing during context switches.
Physically Indexed, Virtually Tagged: These kinds of caches do not exist and not used in the real-world, so we will not talk about them here.

Long story short, on modern processors, the CPU data cache is usually not flushed during a context switch (assuming they are VIPT type caches). However, the CPU caches are very small in size. As a result of this, some or all of the cache lines belonging to the previous process may get evicted by the time it is scheduled back on CPU, which may have a huge impact on its performance.

Performance Impact Due to CPU Pipeline Flush

Although the TLB and cache flushes during a context switch may get avoided on modern systems, an unavoidable event is the flushing of the CPU pipeline.

A pipelined processor splits the execution of a single instruction into several stages, much like the assembly line in a car manufacturing factory. In each cycle, an instruction moves to the next stage, making space for a new instruction in the first stage. This way the processor can continue to issue one new instruction every cycle and when the pipeline is full, it can also retire one instruction every cycle. A full pipeline reflects the optimum usage of the available execution resources on the CPU.

After a context switch, the process incurs a performance penalty due to the pipeline flush. The time required to refill the pipeline depends on the processor architecture and pipeline depth, typically ranging from 10 to 50 cycles in modern CPUs. During this period, the CPU operates below its peak efficiency, as instructions cannot be retired at full throughput. This results in a temporary performance overhead, with potential instruction throughput losses. Pipeline stalls represent a critical hidden cost of context switching, particularly in high-frequency switching environments, where the cumulative overhead can significantly impact overall system performance.

Performance Impact Due to Branch Predictor State

Modern CPUs achieve high performance by leveraging instruction-level parallelism and out-of-order execution to process multiple instructions simultaneously. However, this introduces challenges when executing conditional branch instructions. When the outcome of a branch condition is still being computed by another instruction, the CPU must decide whether to execute one of the possible paths speculatively.

To keep the pipeline full and avoid stalling, CPUs use branch predictors, which estimate the branch outcome based on historical patterns of the specific branch instruction.

Correct Prediction: If the branch predictor guesses correctly, the CPU continues executing instructions at a high throughput.
Misprediction: If the prediction is wrong, the CPU must discard the speculatively executed instructions by flushing the pipeline. The pipeline is then refilled with instructions from the correct branch target, causing a performance penalty similar to a pipeline flush.

Branch prediction units use structures like the Branch Target Buffer (BTB) and Branch History Buffer (BHB) to track the branching history of instructions. These structures have limited capacity, meaning they can only maintain branching pattern information for a finite number of instructions.

During a context switch:

Branch Predictor State: The predictor state is not explicitly flushed, but the new process gradually overwrites the history of the previous process as it executes its own branches.
Impact on Performance: When the old process resumes, its branch predictor history may have been partially or fully replaced. This can result in a higher rate of branch mispredictions initially, as the predictor must relearn the branching patterns of the resumed process. This rebuilding phase can degrade performance until the predictor adapts again.

Speculative Execution Vulnerabilities and their Impact on Context Switching

Before wrapping up this article, it's important to discuss speculative execution vulnerabilities, as they are a hot topic these days. Over recent years, we've seen many of these vulnerabilities emerge in modern processors, with Spectre and Meltdown being among the most notable. These vulnerabilities exploit the processor's speculative execution capabilities—like branch prediction and speculative data prefetching—to access out-of-bounds data temporarily.

During context switches, the hardware might inadvertently leave behind speculative execution results in caches or buffers. With some clever engineering, attackers could use this to peek into another process’s memory, potentially grabbing sensitive data like private keys or passwords.

To mitigate these attacks, it became essential that one process should not be able to access another process’ memory in any case. It led to kernel level mitigations which disabled or flushed the speculative execution hardware during context switches. This included TLB flushes, L1 data cache flushes, branch predictor buffer flushes.

However, the necessity for such measures can vary. Over time, processor manufacturers have introduced microcode updates and hardware improvements that reduce the need for these flushes, especially during regular process context switches.

Here's how it typically stands during process switches in the Linux kernel:

Branch Predictor Buffers: Even though branch predictor buffers were one of the primary mechanisms behind these attacks and the mitigations require them to be flushed, this has a very high impact on the overall performance of the system. As a result, the Linux kernel has made it a configurable setting. By default these buffers are not flushed, but it can be turned on systems where untrusted processes can run.
TLB Flushes: Only required if the processor lacks support for address space identifiers (ASIDs). Other circumstances may trigger TLB flushes, but these are unrelated to speculative execution mitigations.
L1 Data Cache: Like branch predictor buffers, the L1 data cache isn't typically flushed during context switches because of high performance cost. On untrusted systems, the kernel can be configured to perform a flush during context switches.

As always, these things keep changing as new vulnerabilities are found which require new mitigations.

Final Thoughts

Context switching is a necessary evil in an operating system to provide a high throughput and responsive system. But, it can have a dreadful impact on the performance of the processes.

The performance costs of context switching are non-deterministic and depend on the complex dynamics of the different hardware components, and the software. In systems, where low tail latencies are critical, this non-deterministic factor can become problematic.

A few ways to avoid a critical process from being context switched is by pinning it onto a specific CPU, ensuring nothing else runs on that core and setting the priority high for that process.

Another alternative is to use user space threads and a user space scheduler. For instance, the Golang runtime maps several goroutines to a single OS thread and schedules the goroutines in the user space. Even though the cost of doing context switching in user space is higher than kernel space context switching, it is much more deterministic which plays well in systems where a consistent tail latency is desired.

References

Support Confessions of a Code Addict

If you find my work interesting and valuable, you can support me by opting for a paid subscription (it’s $6 monthly/$60 annual). As a bonus you get access to monthly live sessions, and all the past recordings.

Subscribed

Many people report failed payments, or don’t want a recurring subscription. For that I also have a buymeacoffee page. Where you can buy me coffees or become a member. I will upgrade you to a paid subscription for the equivalent duration here.

Buy me a coffee

I also have a GitHub Sponsor page. You will get a sponsorship badge, and also a complementary paid subscription here.

Sponsor me on GitHub

Udit

Dec 18

Thanks abhinav for this. I was wondering does this mean that perf penatly gap between thread level context switch and process context switch has reduced ?

As I am assuming tlb/cahce are not getting flushed in process context switch given these are using process identifier as tag.

Expand full comment

1 reply by Abhinav Upadhyay

1 more comment...