Confessions of a Code Addict

Confessions of a Code Addict

A Systems Engineer’s Guide to Benchmarking with RDTSC

A deep dive into rdtsc, instruction stream serialization, and memory fences for precise cycle-level performance measurement.

Abhinav Upadhyay's avatar
Abhinav Upadhyay
Oct 23, 2025
∙ Paid
Cover image: Depicting rdtsc, lfence and rdtscp instruction with a backdrop of a clock
Cover image: Depicting rdtsc, lfence and rdtscp instruction with a backdrop of a clock

Performance is critical for systems programmers, and accurate benchmarking is the foundation of meaningful optimization. To truly understand where your code spends time, you need precise and low-overhead measurements, especially when a piece of code may execute in just a few hundred CPU cycles.

Most developers reach for familiar high-level timers, such as Python’s time.perf_counter() or Java’s System.currentTimeMillis(). These are convenient but rely on system calls like clock_gettime which introduce hundreds of cycles of overhead. In certain situations, this overhead can be too much. And when profiling production systems, you want the overheads to be as minimal as possible.

We need a way to read time directly from the hardware, without leaving the user space. On x86 systems, that mechanism is the rdtsc instruction. It gives us near-zero-overhead access to the CPU’s internal timestamp counter, but using it correctly requires an understanding of how modern processors execute instructions.

In this article, we’ll learn how to use rdtsc to do benchmarking. Specifically we will cover the following topics in detail:

  • What rdtsc does: How it reads the CPU’s internal timestamp counter and why it provides near-zero-overhead timing.

  • Understanding CPU behavior: How out-of-order execution can distort timing results and why instruction ordering matters.

  • Instruction stream serialization: What it means, how the CPU reorders instructions, and how serializing instructions (like cpuid) enforce strict ordering.

  • Memory fences: How lfence, sfence, and mfence provide lighter-weight ordering guarantees that help isolate measurement code.

  • Combining it all: Practical example of using these mechanisms together to obtain stable and reproducible timing measurements.

By the end, you’ll know not only how to use rdtsc safely and accurately but also why these extra steps are essential for meaningful microbenchmarking.

Writing these deep dives takes 100+ hours of work. If you find this valuable and insightful, please consider upgrading to a paid subscription to keep this work alive.


Understanding The Timestamp Counter in the CPU

In the x86 architecture, every CPU comes with a special 64-bit counter, called the timestamp counter (TSC) that gets incremented at a fixed frequency. If you can read the value of the counter before and after the execution of a block of code, you can accurately tell how many cycles that code took to execute.

When the counter overflows, it resets to 0. However, because it is a 64-bit counter, it will take an extremely long time for it to overflow. For instance, if the counter increments at 1 GHz frequency, it will take 585 years for it to overflow.

The frequency at which the timestamp counter increments is not the same as the real CPU frequency. In the past, it used to be related to the CPU frequency but as recent CPUs started to have dynamic frequency scaling, the timestamp counter was made to tick at a fixed constant frequency to get stable measurements. For example, some of the cores on my laptop have a frequency range of 800 MHz to 4800 MHz, but the TSC ticks at 2.3 GHz.

So, how do we read the TSC? The x86 instruction set provides two instructions for doing this: rdtsc and rdtscp. But to actually measure the timing of a block of code using these is not as simple as simply slapping rdtsc before and after the code block. It is more sophisticated than that. In practice, it looks like the following code snippet:

#include <x86intrin.h>

uint32_t cpuid;
_mm_lfence();
uint64_t start = __rdtsc();

for (int i = 0; i < ITERS; i++) {
  // expensive loop body
}

uint64_t end = __rdtscp(&cpuid);
_mm_lfence();
uint64_t ncycles = end - start;

In this snippet, I have used the GCC compiler intrinsics __rdtsc and __rdtscp for invoking the rdtsc and rdtscp instructions respectively. But you may ask, what is the significance of using _mm_lfence() before and after the measurement? You may also question why we used rdtsc for reading the starting value of the TSC and rdtscp for the ending measurement. To answer these questions, we have to go deeper and think about how the processor executes instructions.

Out of Order Execution and Serializing Instructions

Let’s step back a bit and talk about how the CPU executes instructions.

Modern x86 CPUs do out-of-order execution of the instruction stream to execute multiple instructions in parallel. They do this by looking at a window of instructions in the instruction stream, identifying independent instructions and executing them in parallel. As a result, an instruction that appears later in the program order may execute much earlier than its predecessors.

For example, imagine an instruction stream as shown in the below snippet. Here, we are interested in measuring the time taken to execute instructions I4 to I6, so we have inserted an rdtsc instruction after I3 and I6.

I0, I1, I2, I3, rdtsc, I4, I5, I6, rdtsc,...

Due to the out-of-order nature of the instruction execution, we cannot guarantee if the rdtsc instructions will execute exactly in the right order. It is possible that the CPU executes the first rdtsc after I1. In that case, our measurement will include the timing of I2 and I3 as well, which is not what we want.

We need a way to force the CPU to not execute rdtsc out of its order and also ensure that all the previous instructions have finished executing when it executes rdtsc. This can be achieved by forcing serialization of the instruction stream right before rdtsc, let’s understand what that means.

Serializing the Instruction Stream

There are certain instructions in the x86 architecture that force serialization of the instruction stream. Basically, the serializing instruction acts like a barrier. The CPU cannot execute it until all the instructions appearing before it in the program have finished. Also, it cannot begin executing any instruction appearing after the serializing instruction until the serializing instruction has finished.

To be precise, a serializing instruction also requires that all the flags, registers and memory modifications must finish before it executes and all the CPU buffers must be drained.

So, if we insert such a serializing instruction before rdtsc, then we can guarantee that the rdtsc instruction will not be executed by the processor out of its actual order.

There are a few such serializing instructions available in the x86 architecture, such as:

  • serialize: Serializes the instruction stream

  • cpuid: used to identify the CPU model and features

  • iret: returns control from an interrupt handler back to the interrupted application

  • rsm: resume from system management mode

Out of these, iret and rsm are control flow modifying instructions, so you cannot use them solely for the purpose of serializing the instruction stream. In the past, cpuid was the recommended instruction for use in combination with rdtsc, and it is still an option today. However, it adds a slight overhead because the CPU needs some work to do to execute it apart from serializing the instruction stream. A much lightweight alternative is the lfence instruction that we saw in the snippet above. lfence is not a proper serializing instruction, but a memory ordering instruction. However, it serves the purpose. Let’s understand what it does.

We didn’t consider the serialize instruction because it is only available on Intel processors and missing on AMD. The instruction is purely there for serializing the instruction stream, so it is a good option. Alas, it is not portable.

The lfence instruction

An alternative to using serializing instructions with rdtsc is using memory ordering instructions, such as lfence, sfence, or mfence. These instructions add lesser overhead than pure serializing instructions, such as cpuid. Let’s understand how.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Abhinav Upadhyay
Publisher Privacy ∙ Publisher Terms
Substack
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture