One Law to Rule All Code Optimizations
A systems-level reasoning model for understanding why optimizations succeed or fail.
“One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them.”
— J.R.R. Tolkien
Software optimizations are messy and often unpredictable. Whether you see a win is not guaranteed, and the reasons are usually unclear. Is there a way to reason about this?
Maybe there is. In this article, I show you one law that explains all low-level code optimizations: when they work, and when they don’t. It’s based on the Iron Law of Performance, a model widely known in the hardware world but relatively obscure in software circles.
What we’ll see is that almost every low-level optimization, whether it's loop unrolling, SIMD vectorization, or branch elimination, ultimately affects just three metrics: the number of instructions executed, the number of cycles needed to execute them, and the duration of a single cycle. These are CPU-level metrics, and they directly determine how fast your code runs.
The Iron Law ties them together and gives us a unified reasoning model for software performance. I’ll walk through the original law, reinterpret it in a software context, and use it to explain common low-level optimizations in a more systematic way.
Of course, not all software optimizations fit into this model. Things like algorithmic improvements, contention removal, or language-level tuning (like garbage collection) lie outside its scope. I’m not claiming the Iron Law explains those.
What’s inside:
The Iron Law of Performance for software
Loop unrolling: reducing dynamic instructions
Function inlining: boosting IPC through linearization
SIMD vectorization: trading instruction count for complexity
Branch prediction: reducing pipeline flushes
Cache misses: backend stalls and instruction throughput
A reasoning framework to guide optimization decisions
Before You Read:
This article assumes some knowledge of the CPU microarchitecture and optimization techniques such as branch elimination, loop unrolling etc. If you are unfamiliar with these, I recommend the following two articles before reading this one:
The Iron Law of Performance (Hardware)
First, let’s start by understanding the Iron Law for hardware performance. It is a simple equation that models the performance of the hardware in the context of executing a program. This depends on three factors:
Number of instructions executed (also known as the dynamic instruction count)
Average number of cycles needed to execute those instructions (cycles per instruction or CPI)
Time taken to execute a single CPU cycle (clock cycle time)
The following equation defines the law:
CPU architects use this to analyse how an architectural change impacts the performance of the processor. For example, should they increase the depth of the instruction pipeline?
In a pipelined processor, an instruction moves from one pipeline stage to another in one cycle. Naturally, the cycle time is dependent on the slowest pipeline stage. When you increase the pipeline depth, you breakdown some of the stages into more granular ones, thus reducing the work done in each stage and in turn reducing the cycle time. It means that now the processor can execute more cycles per second.
However, increasing pipeline depth also raises the penalty of cache and branch misses. For example, accessing main memory still takes about 100 ns, which translates to 100 cycles at 1 GHz but doubles to about 200 cycles at 2 GHz when cycle time is halved. Likewise, deepening the pipeline from 15 to 20 stages also increases the branch misprediction penalty from ~15 to ~20 cycles.
These increased latencies and penalties make the average CPI go up as well. So, whether the pipeline depth should be increased and by how much depends on the overall tradeoff. The iron law gives a very simple framework to make these decisions.
When you do low-level software performance optimizations, similar tradeoffs apply. Every optimization affects the program instruction count, cycles per instruction, and sometimes even the CPU clock frequency. So, it makes sense to apply the same model to analyse and reason about software-level optimizations as well. Let’s try to do that in the next few sections.
The Iron Law of Performance for Software
In the context of software, we can slightly tweak the law to the following form:
Here, we’ve replaced CPI (cycles per instruction) with IPC (instructions per cycle). Although they’re mathematical inverses, IPC is more intuitive for software engineers: for example, modern x86 processors can retire up to 4 instructions per cycle, so IPC gives a clearer sense of how close we are to peak throughput.
We’ve also relaxed the equality to a proportionality. When optimizing software, we’re not looking for exact numbers, rather we’re reasoning about trade-offs.
So what role do these three terms play in software performance?
Increasing IPC means the CPU can retire more instructions per cycle, reducing total execution time.
Lowering the dynamic instruction count means fewer instructions need to be executed overall, which also improves IPC and speeds up execution.
Lowering the clock frequency, as sometimes happens with power-hungry instructions (e.g., AVX-512), increases cycle time and harms performance.
We’ll now apply this model to analyze several well-known optimizations: loop unrolling, function inlining, SIMD vectorization, branch elimination, and cache optimizations, and see how each one shifts the Iron Law variables.
Loop Unrolling
Loop unrolling is a classical optimization. Instead of executing one step of the loop body per iteration, you rewrite the loop to execute multiple steps per iteration. Consider the following loop that computes the sum of an integer array.
int sum = 0;
for (int i = 0; i < n; i++) {
sum += arr[i];
}
If we unroll this loop four times, it will look like the following:
int sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
int i = 0;
// Process 4 elements at a time
for (; i + 3 < n; i += 4) {
sum0 += arr[i];
sum1 += arr[i + 1];
sum2 += arr[i + 2];
sum3 += arr[i + 3];
}
// Handle the remainder
int sum = sum0 + sum1 + sum2 + sum3;
for (; i < n; i++) {
sum += arr[i];
}
Now, let’s reason about how such an optimization can improve the performance of a program and what are the tradeoffs to consider, i.e., in what situations it may not deliver performance improvements.
Note: Usually, you don't have to unroll a loop yourself. The compiler does it when it sees it will deliver better performance. But sometimes it may not do that because it cannot guarantee program correctness due to limited knowledge or constraints about the code. So it is useful to be aware of it.
Impact on Instruction Count
For large n
, unrolling reduces the dynamic instruction count. In the example shown above, the loop body executes three instructions: a comparison for loop condition, incrementing the loop counter, and updating the sum. So, the normal loop executes 3n
instructions.
A 4 times unrolled loop executes: one loop comparison, one loop index increment and four additions - so six instructions per iteration, and 6n/4 = 1.5n
instructions for a vector of size n
.
In Iron Law terms, we’ve driven down the Instruction Count by nearly 50 % (for large n
), all else equal. This sounds like an obvious performance win, but we need to also look at how this impacts the IPC.
Impact on IPC
Recall from our Iron Law that Performance ∝ IPC / Instruction Count. We’ve already reduced instruction count, so if we can raise IPC (or at least not lower it), net performance improves. Let’s see how unrolling affects IPC.
Increased Instruction Level Parallelism
The main advantage of loop unrolling is the potential increase in the instruction throughput of the program. The processor is capable of executing multiple instructions per cycle, e.g., the modern Intel processors can execute up to 4 instructions every cycle.
However, to achieve that kind of throughput, the processor needs to have enough independent instructions to execute, which is difficult. Usually, instructions have dependencies between them, i.e., the result produced by one instruction is consumed by the next. Such instructions cannot be executed in parallel.
For example, consider the assembly (generated by -O1
flag to GCC) for the body of the normal for loop shown previously (without loop unrolling):
Let me explain what is going on:
The register
edx
holds the current value of the sum and at every iteration the value ofarr[i]
gets added to it.
The register
rax
holds the address of the current array element,arr[i]
. At the beginning of the loop iteration, the value ofarr[i]
gets added to the sum value inedx
. Then in the 2nd instructionrax
gets incremented by 4, which means nowrax
contains the address of the next array elementarr[i + 1]
.
Finally, the last two instructions check if we have reached the end of the array, and if not then jump back to the beginning of the loop.
So, for a large array, the CPU is going to be executing these four instructions for a while. Can it execute some of these in parallel so that the loop finishes faster? Not quite. The instructions have dependencies between them that stop the CPU from doing that.
Notice that the first addl
instruction that updates the sum in edx
depends on the previous iteration's edx
and rax
values, so the CPU can't issue the next iteration's addl
until the previous iteration's addl
and addq
instructions are finished. In other words, there simply aren't any independent instructions for the CPU to execute in parallel.
Loop unrolling fixes this problem. The following assembly code is the loop body for the unrolled code shown previously.
In the unrolled loop, the compiler assigns each partial sum to its own register (edx
, edi
, esi
, ecx
), so each addl
instruction uses a different register and memory address. This makes these instructions independent and the CPU can issue and execute those in parallel, reducing the number of cycles needed to finish the loop, and improving the IPC.
In Iron Law terms, we’ve increased IPC from roughly 1.0 (due to dependencies) to perhaps ~3.0 or higher, depending on execution port availability. Combined with a 50 % drop in instruction count, that yields a significant net gain.
Note: How many of these add instructions will actually execute in parallel depends on how many functional units are there in the CPU to perform integer addition. So how much unrolling to do also depends on what kinds of instructions are there to execute.
However, it isn’t all rosy and shiny. Loop unrolling can also hamper the IPC in other ways. Let’s see how.
Register Spills Due to Increased Loop Body Size
Unrolling the loop creates many local variables and increases the demand for registers. When unrolled too many times, or when unrolling a large complicated loop, it can result in register spills. It means that for some variables the compiler will have to use the stack when it runs out of registers.
When register spills happen, the instructions that read data from the stack instead of registers take longer to finish. While a value can be read from a register within a single cycle, reading from stack can take 3-4 cycles (assuming an L1 cache hit).
So, the operations that could be done in a single cycle will now take several cycles due to memory access. In such situations, if the CPU doesn’t have other instructions to execute, it will sit idle and waste resources. This increases the average cycles per instruction and lowers the IPC.
In Iron Law terms, register spills reduce IPC, which can partially or completely negate the instruction count reduction. You must weigh these together.
Note: It is not guaranteed that a register spill will necessarily drop the IPC, because sometimes the compiler can schedule the instructions better to keep the CPU busy while other instructions are stalled on memory access. But it is a tradeoff that you risk introducing when being too aggressive with this optimization.
Instruction Cache Pressure Due to Increased Loop Body Size
Another potential impact of unrolling the loop is the increased code footprint that can cause pressure on instruction cache. These days the instruction caches are large enough that unrolling a loop will not result in cache misses for the loop itself, but in larger systems where there is existing pressure on the instruction cache, this may cause eviction of other instructions.
For example, most x86 cores have a 32-64 KB L1 I-cache. For a loop body consisting of four instructions of 4 bytes each, unrolling it 4 times may increase the code size by ~64 bytes but that is negligible.
So, in general it is not a huge concern for high-end CPUs. But we still need to be aware of the tradeoff from Iron Law's perspective because increased instruction cache misses lower how fast the CPU frontend can send instructions to the backend for execution, thus lowering the IPC.
Loop Unrolling from the Lens of the Iron Law
Below is a summary of trade-offs when viewing loop unrolling through the Iron Law.