A Systems Engineer’s Guide to Benchmarking with RDTSC
A deep dive into rdtsc, instruction stream serialization, and memory fences for precise cycle-level performance measurement.
Performance is critical for systems programmers, and accurate benchmarking is the foundation of meaningful optimization. To truly understand where your code spends time, you need precise and low-overhead measurements, especially when a piece of code may execute in just a few hundred CPU cycles.
Most developers reach for familiar high-level timers, such as Python’s time.perf_counter() or Java’s System.currentTimeMillis(). These are convenient but rely on system calls like clock_gettime which introduce hundreds of cycles of overhead. In certain situations, this overhead can be too much. And when profiling production systems, you want the overheads to be as minimal as possible.
We need a way to read time directly from the hardware, without leaving the user space. On x86 systems, that mechanism is the rdtsc
instruction. It gives us near-zero-overhead access to the CPU’s internal timestamp counter, but using it correctly requires an understanding of how modern processors execute instructions.
In this article, we’ll learn how to use rdtsc
to do benchmarking. Specifically we will cover the following topics in detail:
What
rdtsc
does: How it reads the CPU’s internal timestamp counter and why it provides near-zero-overhead timing.Understanding CPU behavior: How out-of-order execution can distort timing results and why instruction ordering matters.
Instruction stream serialization: What it means, how the CPU reorders instructions, and how serializing instructions (like
cpuid
) enforce strict ordering.Memory fences: How
lfence
,sfence
, andmfence
provide lighter-weight ordering guarantees that help isolate measurement code.Combining it all: Practical example of using these mechanisms together to obtain stable and reproducible timing measurements.
By the end, you’ll know not only how to use rdtsc
safely and accurately but also why these extra steps are essential for meaningful microbenchmarking.
Understanding The Timestamp Counter in the CPU
In the x86 architecture, every CPU comes with a special 64-bit counter, called the timestamp counter (TSC) that gets incremented at a fixed frequency. If you can read the value of the counter before and after the execution of a block of code, you can accurately tell how many cycles that code took to execute.
When the counter overflows, it resets to 0. However, because it is a 64-bit counter, it will take an extremely long time for it to overflow. For instance, if the counter increments at 1 GHz frequency, it will take 585 years for it to overflow.
The frequency at which the timestamp counter increments is not the same as the real CPU frequency. In the past, it used to be related to the CPU frequency but as recent CPUs started to have dynamic frequency scaling, the timestamp counter was made to tick at a fixed constant frequency to get stable measurements. For example, some of the cores on my laptop have a frequency range of 800 MHz to 4800 MHz, but the TSC ticks at 2.3 GHz.
So, how do we read the TSC? The x86 instruction set provides two instructions for doing this: rdtsc
and rdtscp
. But to actually measure the timing of a block of code using these is not as simple as simply slapping rdtsc
before and after the code block. It is more sophisticated than that. In practice, it looks like the following code snippet:
#include <x86intrin.h>
uint32_t cpuid;
_mm_lfence();
uint64_t start = __rdtsc();
for (int i = 0; i < ITERS; i++) {
// expensive loop body
}
uint64_t end = __rdtscp(&cpuid);
_mm_lfence();
uint64_t ncycles = end - start;
In this snippet, I have used the GCC compiler intrinsics __rdtsc
and __rdtscp
for invoking the rdtsc
and rdtscp
instructions respectively. But you may ask, what is the significance of using _mm_lfence()
before and after the measurement? You may also question why we used rdtsc
for reading the starting value of the TSC and rdtscp
for the ending measurement. To answer these questions, we have to go deeper and think about how the processor executes instructions.
Out of Order Execution and Serializing Instructions
Let’s step back a bit and talk about how the CPU executes instructions.
Modern x86 CPUs do out-of-order execution of the instruction stream to execute multiple instructions in parallel. They do this by looking at a window of instructions in the instruction stream, identifying independent instructions and executing them in parallel. As a result, an instruction that appears later in the program order may execute much earlier than its predecessors.
For example, imagine an instruction stream as shown in the below snippet. Here, we are interested in measuring the time taken to execute instructions I4
to I6
, so we have inserted an rdtsc
instruction after I3
and I6
.
I0, I1, I2, I3, rdtsc, I4, I5, I6, rdtsc,...
Due to the out-of-order nature of the instruction execution, we cannot guarantee if the rdtsc
instructions will execute exactly in the right order. It is possible that the CPU executes the first rdtsc
after I1
. In that case, our measurement will include the timing of I2
and I3
as well, which is not what we want.
We need a way to force the CPU to not execute rdtsc
out of its order and also ensure that all the previous instructions have finished executing when it executes rdtsc
. This can be achieved by forcing serialization of the instruction stream right before rdtsc
, let’s understand what that means.
Serializing the Instruction Stream
There are certain instructions in the x86 architecture that force serialization of the instruction stream. Basically, the serializing instruction acts like a barrier. The CPU cannot execute it until all the instructions appearing before it in the program have finished. Also, it cannot begin executing any instruction appearing after the serializing instruction until the serializing instruction has finished.
To be precise, a serializing instruction also requires that all the flags, registers and memory modifications must finish before it executes and all the CPU buffers must be drained.
So, if we insert such a serializing instruction before rdtsc
, then we can guarantee that the rdtsc
instruction will not be executed by the processor out of its actual order.
There are a few such serializing instructions available in the x86 architecture, such as:
serialize: Serializes the instruction stream
cpuid: used to identify the CPU model and features
iret: returns control from an interrupt handler back to the interrupted application
rsm: resume from system management mode
Out of these, iret
and rsm
are control flow modifying instructions, so you cannot use them solely for the purpose of serializing the instruction stream. In the past, cpuid
was the recommended instruction for use in combination with rdtsc
, and it is still an option today. However, it adds a slight overhead because the CPU needs some work to do to execute it apart from serializing the instruction stream. A much lightweight alternative is the lfence
instruction that we saw in the snippet above. lfence
is not a proper serializing instruction, but a memory ordering instruction. However, it serves the purpose. Let’s understand what it does.
We didn’t consider the
serialize
instruction because it is only available on Intel processors and missing on AMD. The instruction is purely there for serializing the instruction stream, so it is a good option. Alas, it is not portable.
The lfence
instruction
An alternative to using serializing instructions with rdtsc
is using memory ordering instructions, such as lfence
, sfence
, or mfence
. These instructions add lesser overhead than pure serializing instructions, such as cpuid
. Let’s understand how.
Just like, there is an instruction pipeline in the CPU, there is a sort of a pipeline for memory as well. When you write to a memory address, that write does not go to the L1 cache immediately. Rather, it first goes to a buffer in the CPU core, called the write buffer. At some point, the buffer gets drained and the entries are updated in the L1 cache. At this point, the cache coherence protocol kicks in so that all the other CPU cores can also see the latest value of that memory location. But there is a problem here.
The CPU core on which the write was performed can always read the latest value by doing a store-to-load forwarding. However, until the store buffer is drained, other CPU cores are unaware of any modifications to that location. When you combine out-of-order execution of instructions, where the writes are being done in different order, this can lead to race conditions. For example, consider following code executes on CPU-1:
x = 1; // initial value of x = 0
y = 1; // initial value of y = 0
And following code executes on CPU-2:
if (y == 1) {
x += 1;
}
It is very likely that the CPU-2 sees y == 1
, but x == 0
, due to out-of-order execution. In that case, the resulting value of x may not be what the programmer expects. This is a race condition.
The solution to this is memory ordering instructions that serialize the loads and stores in memory so that they execute in the right order and correct values are seen globally. There are three such instructions on x86: lfence
, sfence
, and mfence
.
lfence: lfence
enforces ordering only on load instructions. CPU ensures that before the lfence
instruction executes, all the previous loads must finish. It also requires that any instructions occurring after lfence
cannot execute until lfence
completes. However, any writes that might have been executing before lfence
don’t need to finish before lfence
.
sfence: sfence
is similar to lfence but it enforces ordering on stores. Any stores that happened before sfence
must finish, and the store buffer must be drained before sfence
executes. Any instruction that appears after sfence
cannot begin executing until sfence completes. However, any loads that were in flight before sfence
don’t need to finish before sfence.
mfence: mfence
performs serialization of both loads and stores.
As you can see, in a way, these instructions serialize the instruction stream because no instruction after the fence can begin executing until the fence completes. The difference from a fully serializing instruction, such as cpuid
, is that the conditions are relaxed and introduce much less overhead.
So, this answers the question about why we used _mm_lfence()
before the first rdtsc
in our measurement code. The next question is, why did we use rdtscp
to measure the end timing instead of rdtsc? Let’s dig into that.
Understanding rdtsc vs rdtscp
Let’s analyse what would happen if we used rdtsc
to measure the end value of the TSC. We again have the problem of out-of-order execution. The CPU may execute the 2nd rdtsc
before the rest of the instructions. In that case, our measurement would be wrong. The solution would be to add a serializing instruction before rdtsc
, such as cpuid
or an lfence
.
However, they have their own overheads, so we will end up measuring the overhead of I4
, I5
, I6
and also the overhead of the serializing instruction. It has been found that this can make the timing measurements highly unreliable.
The rdtscp
instruction solves this. It is slightly different from rdtsc
in what it does and how it is executed. Unlike rdtsc
, rdtscp
reads the TSC value and also the the id of the CPU on which it was executed. It is not a completely serializing instruction, but the processor cannot execute rdtscp
until all previous instructions have finished and all the loads are globally visible. So, by using rdtscp
for measuring the end value the TSC, we don’t need any explicit serializing instruction before it.
However, there is still a problem. Any instruction appearing after rdtscp
can still execute before rdtscp
does. So, we still need a barrier after rdtscp
. This explains why we have to add an lfence
after rdtscp
.
A Complete Example: Measuring Code Execution with RDTSC
Now that we understand the theory behind rdtsc, serialization, and memory fences, let’s put it all together into a robust working example.
The Goal
We’ll measure the number of CPU cycles taken to sum the elements of an integer array. The point isn’t the computation itself, but to demonstrate a safe and repeatable way to benchmark any code fragment using rdtsc.
The Code
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <x86intrin.h>
#define ARRAY_SIZE 1000000
#define REPEAT 100
int main(void) {
// Allocate and initialize an array
int *array = malloc(ARRAY_SIZE * sizeof(int));
for (int i = 0; i < ARRAY_SIZE; i++) {
array[i] = i;
}
uint64_t total_cycles = 0;
uint32_t aux;
for (int iter = 0; iter < REPEAT; iter++) {
// Serialize before measurement
_mm_lfence();
// Read starting value of TSC
uint64_t start = __rdtsc();
volatile long sum = 0;
for (int i = 0; i < ARRAY_SIZE; i++) {
sum += array[i];
}
// Read final value of TSC
uint64_t end = __rdtscp(&aux);
// Serialize after measurement
_mm_lfence();
total_cycles += (end - start);
}
printf(”Average cycles per run: %llu\n”, total_cycles / REPEAT);
free(array);
return 0;
}
Compile this program with optimization enabled to minimize compiler-introduced noise:
gcc -O2 -march=native rdtsc_benchmark.c -o rdtsc_benchmark
Run it several times and you should see consistent cycle counts with small variations due to background system activity.
Things to Note
Serialization around rdtsc: The
_mm_lfence()
calls ensure that no prior or subsequent instructions are reordered across the timestamp reads. Without these fences, speculative or out-of-order execution could distort timing.Using rdtscp for the end timestamp: As discussed earlier,
rdtscp
waits until all previous instructions retire, giving a cleaner boundary for the measured code. It also reads the CPU ID into aux, which can help detect if your thread migrated to another core (more on that below).Averaging multiple iterations: Running the loop multiple times and averaging results reduces the influence of transient noise from interrupts, cache state, or OS scheduling.
However, there are a few more details that are worth knowing about.
Dealing with Core Migration
While this example works well in most cases, there’s one subtle issue you must be aware of: core migration.
On systems with multiple CPUs or cores, the TSC values are per-core counters. Modern systems typically maintain invariant TSC, meaning all cores increment at the same constant rate and stay synchronized, but that doesn’t make core migration harmless. Migration itself can incur significant overhead because of cache misses, pipeline refills, and scheduler delays. If a thread jumps to another core, the measured code might appear slower even though the difference comes from migration cost, not the code’s actual performance.
The rdtscp
instruction helps by returning the CPU ID in the aux
variable. If aux
changes between the start and end measurements, your thread migrated between cores and your measurement is invalid. In such cases, simply discard that sample and rerun.
If you want to ensure measurements remain on the same core, call rdtscp
at both the beginning and end of the measurement to capture the aux values from each read. Comparing these values allows you to detect migration explicitly. Alternatively, you can pin your thread to a specific core using OS-level APIs like sched_setaffinity (Linux) to eliminate migration altogether.
Calibrating the Counter (Optional)
This example gives us the measurement in terms of the number of cycles and not time. If you want to report timing in seconds, you need to know the frequency of the TSC. You can obtain it through /proc/cpuinfo
or lscpu
.
Then you can compute time as:
double seconds = cycles / (tsc_khz * 1000.0);
However, for relative benchmarking, comparing different implementations or micro-optimizations using raw cycle counts is usually sufficient.
Summary and Key Takeaways
Let’s recap the key insights from this deep dive into rdtsc and precise timing on x86 CPUs:
System timers aren’t enough: High-level timers like
clock_gettime
add too much overhead for microbenchmarks that execute in a few hundred cycles.
The Timestamp Counter (TSC): A 64-bit per-core counter that increments at a fixed rate, allowing direct hardware-level timing without kernel transitions.
Out-of-order execution challenges: CPUs may execute
rdtsc
instructions early or late, distorting timing results if not serialized.
Serialization and memory fences: Use
lfence
orcpuid
to enforce correct ordering.lfence
provides lower overhead and is widely supported.
Using rdtscp: Preferred for the end measurement because it waits for previous instructions to retire and can detect CPU migration via the aux value.
Core migration awareness: Even with invariant TSC, migration adds cache and pipeline overhead. Detect it using
rdtscp
or prevent it via CPU affinity.
Calibrating the counter: The TSC ticks in cycles; convert to seconds only if needed for absolute timing.