Two Threads, One Core: How Simultaneous Multithreading Works Under the Hood

Ever wondered how your CPU handles two tasks at once? Discover the magic of Simultaneous Multithreading and see what’s really going on inside.

Jul 24, 2024

Simultaneous multithreading (SMT) is a feature that lets a processor handle instructions from two different threads at the same time. But have you ever wondered how this actually works? How does the processor keep track of two threads and manage its resources between them?

In this article, we’re going to break it all down. Understanding the nuts and bolts of SMT will help you decide if it’s a good fit for your production servers. Sometimes, SMT can turbocharge your system's performance, but in other cases, it might actually slow things down. Knowing the details will help you make the best choice.

So, let’s dive in and figure out how SMT works, why it was invented in the first place, and what it means for you.

Disclaimer: Much of the discussion in this article is about Intel’s implementation of SMT, also called hyper-threading. It is based on their white paper published in 2002

aerial photography of concrete roads — Photo by Denys Nevozhai on Unsplash

Watch Video Instead: If you prefer video over text, then you can also watch the recording of a live session I did on this topic:

Recording: How Hyper-Threading Works — A Microarchitectural Perspective

Abhinav Upadhyay

July 8, 2024

Read full story

Background and Motivation Behind Hyper-Threading

SMT was introduced to improve the utilization of the resources in the processor. At the microarchitecture level, processors consist of hundreds of registers, multiple load/store units and multiple arithmetic units. To utilize these better, processors also employ various techniques for instruction level parallelism (ILP), such as instruction pipelining, superscalar architecture, out-of-order execution to name a few.

A pipelined processor improves the resource utilization by breaking down the execution of an instruction into multiple stages which form a pipeline, like the assembly line of a factory. In each cycle, an instruction moves from one stage of the pipeline to the next and the processor adds a new instruction to the first stage of the pipeline. The following figure shows how pipelining works for a pipeline of depth 5. As you can see, 5th cycle onwards, the processor will have upto 5 instructions in flight each cycle, and it will finish 1 instruction each cycle after that.

Illustration of a five-stage instruction pipeline. In each cycle, an instruction moves to the next stage which makes up space in the first stage and a new instruction can also start getting processed in each cycle. A super deep pipeline will have many instructions being processed in parallel.

Modern processors are also superscalar, which means instead of issuing one instruction each cycle, they can issue multiple instructions. For instance, the recent Intel core i7 processors can issue 4 instructions each cycle (also called the issue width of the processor).

Instruction pipelining and superscalar architecture significantly improve the instruction throughput and resource utilization of the processor. However, in practice, this max utilization can be difficult to achieve. To execute so many instructions in parallel, the processor needs to find enough independent instructions in the program, which is very hard.

This typically leads to two kinds of wastages. One is horizontal waste which occurs when the processor is not able to find enough independent instructions in the thread to saturate the issue width of the processor.

The other type of wastage is vertical waste that occurs when the processor is unable to issue any instructions in a cycle because all the next instructions in the program are dependent on the currently executing ones.

Illustration of horizontal and vertical state in a processor with issue width of 5 instructions per cycle. The empty boxes represent the instruction slots where the processor could not issue an instruction

One way to better utilize the processing power is using traditional multithreading where the processor context switches between multiple threads. In this scheme, within a given cycle the processor issues instructions for only one thread, so this may still result in horizontal waste. However, in the next cycle the processor can context switch and issue instructions for another thread and avoid vertical wastage. This results in improved CPU utilization, however, with larger issue width processors, the horizontal wastage can still be significant. Also, there is the overhead of context switching between the threads.

This is where the idea of simultaneous multithreading was introduced. It enables the processor to issue instructions for multiple threads in the same cycle without any overhead of context switching. By definition, instructions of different threads are independent and can be executed in parallel which ultimately results in full utilization of the execution resources.

Even though the idea of SMT doesn’t put a limit on the number of threads, Intel’s implementation of SMT (called hyper-threading) restricts it to two threads per core.

Microarchitecture Level Implementation of Simultaneous Multithreading in Intel Processors

We understand why SMT was introduced, now let’s learn about how it is implemented. Along with the implementation details we will also cover how it actually works.

A normal non-SMT processor can only execute instructions for one thread at a time. This is because every thread has an associated context to represent the current state of the program on the processor, which is also called the architecture state. This includes the data in the registers, the program counter value, the control registers etc.

To simultaneously execute instructions of two threads, the processor needs to be able to represent the state of the two threads simultaneously. So to implement the SMT capability, the hardware designers duplicated the architecture state of the processor. By doing so, a single physical processor appears as two logical processors to the operating system (OS), so that it can schedule threads for execution on them.

Illustration of a two core processor without SMT (top) and a two core processor with SMT (bottom). For implementing SMT the architecture state has been duplicated in the bottom processor and it will appear as having four processing cores to the OS.

Apart from that, at the microarchitecture level, the processor also has various buffers and execution resources as well. To execute the instructions of two threads simultaneously, these resources are also either duplicated or shared between the two logical processors. The decision of whether to duplicate or to share a resource is based on many factors. For instance, how costly is it to duplicate a resource, in terms of power consumption and real-estate on the chip.

The crucial details about how SMT works lies in its microarchitectural implementation, so let’s go deeper into that.

Processor Microarchitecture

The processor exposes the instruction set architecture (ISA) as the public interface for the programmers to program the CPU. The ISA includes the set of instructions, and the registers that the instructions can use. The microarchitecture of the processor is its internal implementation detail. Different processor models can support the same ISA but at the microarchitecture level they might be different.

The microarchitecture has three parts: the frontend, the backend and the retirement unit. The following diagram shows the schematics of the microarchitecture of a modern day processor:

The schematics of the microarchitecture of a modern processor consisting of the frontend, the backend, and the retirement unit

The frontend is the part which contains the instruction control unit that fetches and decodes the program instructions which should be executed next.

The backend consists of the execution resources, such as the physical registers, the arithmetic units, and the load/store units. It picks up the decoded instructions provided by the frontend, allocates execution resources for them and schedules them for execution.

The retirement unit is where the results of the executed instructions are finally committed to the architecture state of the processor.

Instruction Execution in an SMT Capable Processor

To understand how SMT works we will go deeper into each of three components of the CPU microarchitecture. Let’s start with the frontend.

SMT Implementation in the Frontend

The following figure shows a more zoomed in view of the microarchitecture frontend. It consists of several components with each having a distinct role behind the fetching and decoding of instructions. Let’s talk about them one by one.

A zoomed in view of the frontend of an X86 processor. Source: Intel Technology Journal, Vol 06, Issue 01, 2002.

Instruction Pointers

To track which instructions to fetch, the frontend contains an instruction pointer which contains the address of the next instruction of the program.

In the case of an SMT capable processor, there are two sets of instruction pointers which track the next instruction for the two programs independently.

Trace Cache

The instruction pointers gives the addresses of the next instructions of the threads and the frontend has to read the instructions from those addresses. Before doing that it first checks for the existence of those instructions in the trace cache.

The trace cache contains recently decoded traces of instructions. Instruction decoding is an expensive operation and some instructions need to be executed frequently. Having this cache helps the processor cut down the instruction execution latency.

Trace cache is shared dynamically between the two logical processors on an as needed basis. If one thread is executing more instructions than the other, it is allowed to occupy more entries in the trace cache.

Each entry in the cache is tagged with the thread information to distinguish the instructions of the two threads. The access to the trace cache is arbitrated between the two logical processors each cycle.

The Instruction Translation Lookaside Buffer (ITLB) Cache

If there is a miss in the trace cache, then the frontend looks for the instruction for the given address in the L1 instruction cache. If there is a miss in the L1 instruction cache, then it needs to fetch the instruction from the next level cache or the main memory.

The L1 instruction caches data using its virtual address, but main memory lookups require physical addresses. To translate the virtual addresses into physical addresses, the instruction lookaside buffer (ITLB) is used which contains the recently translated virtual addresses.

In an SMT capable processor, each logical processors has its own ITLB cache.

The instruction fetch logic for fetching the instructions from the main memory works on a first come first served basis, but it reserves at least one request slot for each logical processor so that both can make progress.

Once the instructions arrive from the main memory, they are kept in a small streaming buffer before they get picked up for decoding. These buffers are also small structures and duplicated for the logical processors in an SMT capable processor.

The Uop Queue

Once the instructions are fetched, they are decoded into smaller and simpler instructions called micro instructions (uops). These uops are put into the uop queue which acts as the boundary between the CPU frontend and backend.

The uop queue is shared equally between the two logical processors. This static partitioning enables both the logical processors to make independent progress.

SMT Implementation in the Microarchitecture Backend

Once the Uop queue has microinstructions ready, the role of the backend starts. The following figure shows a zoomed in view of the backend of an Intel X86 processor.

The zoomed in view of the backend of an Intel X86 processor. Source: Intel Technology Journal, Vol 06, Issue 01, 2002.

Let’s talk about what happens in the backend component wise.

Resource Allocator for Out-of-Order Execution

The backend picks up the micro instructions from the uop queue and executes them. However, it executes them out of their original program order.

Nearby instructions in a program are typically dependent on each other and these instructions may stall because they may perform a long latency operation, such as reading from main memory. As a result, all of their dependent instructions will also have to wait. In such a situation, the processor’s resources are wasted. To alleviate this problem, out-of-order execution engine is employed which picks up later instructions of the program and executes them out of their original order.

The out-of-order execution engine consists of an allocator which identifies the resources required by these micro instructions and allocates them based on their availability.

The allocator allocates resources for the micro instructions of one logical processor in one cycle and then switches to the other logical processor in the next cycle. If the uop queue has micro instructions for only one of the logical processors, or one of the logical processors has exhausted its share of resources, then the allocator uses all the cycles for the other logical processor.

Shared Execution Resources

So what are these resources that the allocator allocates to the micro instructions and how are they shared?

The first resource that the micro instructions need is registers. At the ISA level the processor might only have a very few registers (e.g. X86-64 has 16 general purpose integer registers), but at the microarchitecture level there are hundreds of physical integer registers, and similar number of floating-point registers. In an SMT enabled processor, these registers are divided equally between the two logical processors.

Apart from the registers, the backend also has a number of load and store buffers. These buffers are used for doing memory read and write operations. Again, in an SMT enabled processor, they are divided equally between the logical processors.

Register Renaming

To enable out-of-order execution, the backend also needs to perform register renaming. Because at the ISA level there are only a handful of architectural registers, the program instructions will reuse the same register in many independent instructions. And, the out-of-order execution engine wants to execute these instructions ahead of their original order and in parallel. For doing so, it renames the original logical registers used in the program instructions to one of the physical registers. This mapping is maintained in the register alias table (RAT).

Because the two logical processors have their own sets of architectural registers, they also have their own copy of the RAT.

Instruction Ready Queues

After the register renaming and allocator stages, the instructions are almost ready to execute. They are put into two sets of queues — one is for the memory read/write instructions and the other is for all other general instructions. These queues are also partitioned equally between the two logical processors in an SMT enabled core.

Instruction Schedulers

The processor has multiple instruction schedulers which operate in parallel. In each CPU cycle, some of the instructions from the instruction ready queues are pushed to the schedulers. The queues switch between the instructions of the two logical processors each cycle, i.e., in one cycle they push the instruction of one logical processor and in the next they switch to the second logical processor.

Each scheduler itself has a small internal buffer to store these pushed instructions temporarily until the scheduler can schedule them for execution. Each of these instructions need certain operands and execution units to be available. As soon as the required data and resources for one of the instructions become available, the scheduler dispatches that instruction for execution.

The schedulers do not care about the logical processors, they will execute a micro instruction as soon as the resources required by that instruction are available. But to ensure fairness, there is a limit on the number of active entries for a logical processor in the scheduler’s queue.

Reorder Buffer

After the execution of an instruction finishes and its result is ready, it is placed in the reorder buffer. Even though the instructions are executed out-of-order, they need to be committed to the processor’s architecture state in their original program order. The reorder buffer enables this.

The reorder buffer is split equally between the two logical processors in an SMT enabled core.

Retirement Unit

The retirement unit tracks when the instructions are ready to be committed to the architecture state of the processor and retires them in their correct program order.

In an SMT enabled processor core, the retirement unit alternates between the micro instructions for each logical processor. If one of the logical processors does not have any micro instructions to be retired, then the retirement unit spends all the bandwidth on the other logical processor.

After an instruction retires, it might also have to write to the L1 cache. At this point the selection logic comes into the picture to do these writes, and that also alternates between the two logical processors each cycle to write the data to the cache.

Memory Subsystem

While we have covered how the execution resources of a processor core are shared between the logical processors for an SMT enabled system, we have not talked about memory access. Let’s discuss what happens there.

The Translation Lookaside Buffer

The translation lookaside buffer (TLB) is a small cache which holds the translation of virtual addresses to physical addresses for data requests. The TLB is shared dynamically between the two logical processors on an as needed basis. To distinguish the entries for the two logical processors, each entry is also tagged with the logical processor id.

The L1, L2 and L3 Caches

Each CPU core has its own private L1 cache. Depending on the microarchitecture, the L2 cache might also be private or it might be shared between the cores. If there is an L3 cache, it is shared between the cores. The caches are also oblivious to the existence of the logical processors.

Depiction of L1 and L2 caches in a 2 core processor. Each core has its own private L1 and L2 caches.

As the L1 (and possibly L2) cache is private to the core, it will contain data for both the logical processors on an as needed basis. This can cause conflict and eviction of the data and hamper the performance. On the other hand if the threads running on the two logical processors are working with the same set of data, the shared cache might improve their performance.

Performance Impact of SMT

At this point we have covered almost everything about how SMT is implemented at the microarchitecture level and how the instructions for the two logical processors are executed in parallel. Now let’s discuss the performance impact.

Running a Single Thread on an SMT Enabled Core

As we have seen, enabling SMT on a CPU core requires sharing many of the buffers and execution resources between the two logical processors. Even if there is only one thread running on an SMT enabled core, these resources remain unavailable to that thread which reduces its potential performance.

Apart from the wasted shared resources in the absence of the 2nd thread, there is another performance impact. The operating system runs an idle loop on the unused logical processor which waits for instructions to arrive. This loop also wastes resources which could otherwise be spent on letting the other logical processor at its peak potential.

On the Intel core processors there does not seem to be any sharing or partitioning of resources when only one thread is running on a core. They mention it as an improvement introduced in that generation of processors. Source: Intel Technology Journal, Vol 14, Issue 3, 2010

Note mentioned in Intel’s white paper which introduced the Intel core processors microarchitecture. They mention that as an improvement in these generation of processors, there is no cost when only one thread is running on an SMT enabled core.

Running Two Threads on an SMT Enabled Core

If you have two threads running on the two logical processors, then one of the things to think about is their cache access patterns. If the threads are using the cache aggressively and competing for it then they are bound to run into conflicts and evict each other’s data, which will degrade their performance.

On the other hand, if the threads are cooperating in nature, they might help improve each other’s performance. For instance, one thread is producing some data which is consumed by the other thread, then their performance will improve because of the data sharing in the cache.

If the two threads are not competing for cache, then they might be able to run fine without hampering each other’s performance, while improving the resource usage of the CPU core.

However, many experts believe that when absolute maximum performance is needed for a program, it is best to disable SMT so that the single thread will have all the resources available to it.

Security Vulnerabilities around SMT

Apart from performance, there are also security issues associated with SMT which were discovered in the recent few years (see this and this for examples). Because of the shared resources and speculative execution of instructions, a lot of these issues open up possibilities of leaks of sensitive data to the attacker. As a result, the general advice has been to disable SMT in the systems. There is also rumor that because of these issues Intel might remove hyperthreading from their next generation of processors (Arrow Lake).

Closing Thoughts

Let's wrap things up. Understanding how Simultaneous Multithreading (SMT) works is super helpful when you’re deciding whether or not to use it in your production servers. SMT was designed to make better use of CPU resources and boost instruction throughput. While it does a good job of that by letting multiple threads run at the same time, there are definitely some trade-offs to keep in mind.

Inside the processor, SMT means duplicating certain parts and sharing or dividing up others between the threads. This can lead to mixed results depending on what kind of work your CPU is handling. Sure, SMT can improve resource usage and system throughput overall, but it can also cause competition for shared resources, which can slow down individual threads.

Security is another big factor. Recent vulnerabilities have shown that sharing resources in SMT-enabled CPUs can be risky. Sensitive data could end up getting exposed, which is why some experts often recommend disabling SMT in security-critical systems.

So, should you use SMT? It really depends. If your workloads need the highest performance and lowest latency, turning SMT off might give you that edge. But if you’re running general-purpose tasks that can benefit from more parallelism, keeping SMT on could be a win.

By getting a handle on these details, you’ll be better equipped to decide what's best for your setup, ensuring you get the most efficient—and secure—performance out of your servers.

References

Support Confessions of a Code Addict

If you find my work interesting and valuable, you can support me by opting for a paid subscription (it’s $6 monthly/$60 annual). As a bonus you get access to monthly live sessions, and all the past recordings.

Many people report failed payments, or don’t want a recurring subscription. For that I also have a buymeacoffee page. Where you can buy me coffees or become a member. I will upgrade you to a paid subscription for the equivalent duration here.

Buy me a coffee

I also have a GitHub Sponsor page. You will get a sponsorship badge, and also a complementary paid subscription here.

Sponsor me on GitHub

Peter

Oct 28

Modern Intel CPU's don't use a trace cache, but an uops cache.

Expand full comment

'This loop also wastes resources which could otherwise be spent on letting the other logical processor at its peak potential.'

If you want to properly disable HT whereby all resources are given to the only HT running, it is best to disable HT through the BIOS and not through the OS. E.g. instead of a partitioned reorderbuffer/loadbuffer/storebuffer, all the entries are made available to the single thread running on that core.

1 more comment...