<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Confessions of a Code Addict]]></title><description><![CDATA[Deep dives into compilers, performance optimization, Linux internals, and low-level programming. For engineers who love understanding systems at a fundamental level.]]></description><link>https://blog.codingconfessions.com</link><image><url>https://substackcdn.com/image/fetch/$s_!lstI!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png</url><title>Confessions of a Code Addict</title><link>https://blog.codingconfessions.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 03 Apr 2026 17:15:38 GMT</lastBuildDate><atom:link href="https://blog.codingconfessions.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Abhinav Upadhyay]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[codeconfessions@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[codeconfessions@substack.com]]></itunes:email><itunes:name><![CDATA[Abhinav Upadhyay]]></itunes:name></itunes:owner><itunes:author><![CDATA[Abhinav Upadhyay]]></itunes:author><googleplay:owner><![CDATA[codeconfessions@substack.com]]></googleplay:owner><googleplay:email><![CDATA[codeconfessions@substack.com]]></googleplay:email><googleplay:author><![CDATA[Abhinav Upadhyay]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[How PyTorch Generates Random Numbers in Parallel on the GPU]]></title><description><![CDATA[A deep dive into Philox and counter-based RNGs]]></description><link>https://blog.codingconfessions.com/p/how-pytorch-generates-random-numbers</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/how-pytorch-generates-random-numbers</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Thu, 18 Dec 2025 10:26:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/46d5cb05-e44f-40a7-a292-3eda768af57d_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>GPUs power modern deep learning models because these models rely on tensor operations, which can be efficiently parallelized on GPUs with their thousands of cores. However, apart from tensor computations, these models also rely on random numbers. For example, to initialize the model weights, during dropout, data sampling, stochastic gradient descent, etc.</p><p>So, the question arises: how do frameworks like PyTorch generate random numbers in parallel on GPU devices? Because if random number generation becomes a bottleneck, it can significantly slow down the entire training or inference pipeline.</p><p>The answer lies in a clever algorithm called <strong>Philox</strong>, a counter-based parallel random number generator. In this article, we&#8217;ll explore:</p><ol><li><p>Why traditional random number generators don&#8217;t parallelize well</p></li><li><p>How Philox works and what makes it different</p></li><li><p>How to parallelize random number generation using Philox</p></li><li><p>PyTorch&#8217;s implementation of Philox by dissecting its C++ and CUDA code</p></li></ol><p>By the end, you&#8217;ll understand how that simple <code>torch.randn()</code> call efficiently generates millions of random numbers in parallel on your GPU while maintaining perfect reproducibility.</p><div><hr></div><h3>Cut Code Review Time &amp; Bugs in Half (<em>Sponsored</em>)</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://coderabbit.link/abhi" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nW92!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png 424w, https://substackcdn.com/image/fetch/$s_!nW92!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png 848w, https://substackcdn.com/image/fetch/$s_!nW92!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png 1272w, https://substackcdn.com/image/fetch/$s_!nW92!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nW92!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://coderabbit.link/abhi&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nW92!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png 424w, https://substackcdn.com/image/fetch/$s_!nW92!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png 848w, https://substackcdn.com/image/fetch/$s_!nW92!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png 1272w, https://substackcdn.com/image/fetch/$s_!nW92!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4dec8f6-3909-4efb-b2b4-9d232bc7de59_1600x800.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.</p><p>Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.</p><p>CodeRabbit has so far reviewed more than 10 million PRs, installed on 2 million repositories, and used by 100 thousand Open-source projects. CodeRabbit is free for all open-source repo&#8217;s.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://coderabbit.link/abhi&quot;,&quot;text&quot;:&quot;Get Started Today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://coderabbit.link/abhi"><span>Get Started Today</span></a></p><div><hr></div><h2>Problem with Traditional PRNGs</h2><p>Let&#8217;s start by developing an intuition about why traditional pseudo random number generators (PRNGs) are sequential and not suitable for parallel hardware, such as GPUs.</p><p>A PRNG needs to be able to reproduce the same sequence of random numbers when initialized with a specific seed. A natural way of achieving this is through a state transformation function that takes the current state of the generator as input and produces a new state. As long as the function is deterministic, it is guaranteed that we can reproduce the exact same sequence of numbers starting from the same initial state. Mathematically, it can be expressed like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{s}_{n+1} = \\text{f} \\text{(s}_n)&quot;,&quot;id&quot;:&quot;NURVVXHCKV&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, the next state is derived by applying the function <code>f</code> on the current state <code>s_n</code>. As you can see, this is a sequential model where you can&#8217;t jump ahead arbitrarily without computing all the previous states, and you can&#8217;t shard the generation of the random numbers by distributing the work across threads.</p><p>To parallelize the generation of random numbers, we need a different model where we can directly generate the nth random number without having to go through the generation of all the previous n-1 numbers. Mathematically, it should look like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{n} = b(n)&quot;,&quot;id&quot;:&quot;DNCZCFZDRN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where <code>x_n</code> is the nth random number we wish to generate by applying a function <code>b</code>. Here, we can think of the input n as an integer counter and as such the PRNGs that follow this model are called counter-based random number generators. One such counter-based PRNG is the Philox PRNG, used widely in frameworks such as PyTorch for parallel random number generation on GPUs. </p><p>Let&#8217;s understand how Philox works.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Writing these deep dives takes 100+ hours of work. If you find this valuable and insightful, please consider upgrading to a paid subscription to keep this work alive.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>How Philox Works</h2><p>The Philox algorithm, short for &#8220;Product HI, LOw, with XOR&#8221;, is a counter-based PRNG that was designed specifically for parallel computation. It was introduced by Salmon et al. in 2011 as part of the <a href="https://www.thesalmons.org/john/random123/papers/random123sc11.pdf">Random123 library</a>. The key insight behind Philox is that we can use a cryptographic-like construction to transform a counter into a pseudorandom number.</p><h3>The Core Idea: Treating RNG as Encryption</h3><p>We can think of the counter-based RNG problem this way: we want to take a sequence of integers (0, 1, 2, 3, &#8230;) and scramble them so thoroughly that they appear random. This is conceptually similar to what a <a href="https://en.wikipedia.org/wiki/Block_cipher">block cipher</a> does in cryptography, it takes a plaintext message and a key, then produces a ciphertext that looks random.</p><p>In Philox&#8217;s case:</p><ul><li><p>The <strong>counter</strong> (n) acts like the plaintext</p></li><li><p>The <strong>seed</strong> acts like the encryption key</p></li><li><p>The output is our pseudorandom number</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!67va!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!67va!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png 424w, https://substackcdn.com/image/fetch/$s_!67va!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png 848w, https://substackcdn.com/image/fetch/$s_!67va!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png 1272w, https://substackcdn.com/image/fetch/$s_!67va!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!67va!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png" width="559" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:559,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13904,&quot;alt&quot;:&quot;Philox takes a counter and a key (derived from the seed) as its input and produces a random number as its output&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Philox takes a counter and a key (derived from the seed) as its input and produces a random number as its output" title="Philox takes a counter and a key (derived from the seed) as its input and produces a random number as its output" srcset="https://substackcdn.com/image/fetch/$s_!67va!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png 424w, https://substackcdn.com/image/fetch/$s_!67va!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png 848w, https://substackcdn.com/image/fetch/$s_!67va!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png 1272w, https://substackcdn.com/image/fetch/$s_!67va!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64290860-d8d0-4060-8f8a-c82c3a586563_559x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Philox takes a counter and a key (derived from the seed) as its input and produces a random number as its output</figcaption></figure></div><p>The beauty of this approach is that any thread can independently compute its random number by knowing just two things: which counter value it needs (its position in the sequence) and the seed. No synchronization or communication with other threads is needed.</p><h3>The Philox Construction</h3><p>Philox operates on fixed-size inputs and outputs. The most common variant is <strong>Philox-4x32</strong>, which means:</p><ul><li><p><strong>4</strong>: Works with 4 32-bit integers at a time</p></li><li><p><strong>32</strong>: Each integer is 32 bits wide</p></li></ul><p>So Philox-4x32 takes a 128-bit counter (represented as four 32-bit integers) and produces a 128-bit output (four 32-bit random numbers). This is perfect for generating multiple random numbers at once, which is common in GPU workloads.</p><p>The algorithm consists of applying multiple <strong>rounds</strong> of a transformation function. Each round performs these operations:</p><ol><li><p><strong>Multiplication and splitting</strong>: Multiply pairs of the input integers and split the results into high and low parts</p></li><li><p><strong>XOR with keys</strong>: XOR certain parts with key-derived values</p></li><li><p><strong>Permutation</strong>: Shuffle the positions of the integers</p></li></ol><p>Let&#8217;s break down a single round in detail. Philox-4x32 works with four 32-bit values, which we&#8217;ll call (<em>c</em>0&#8203;,<em>c</em>1&#8203;,<em>c</em>2&#8203;,<em>c</em>3&#8203;). Each round transforms these values through the following steps:</p><h4><strong>Step 1: Multiply and Split</strong></h4><p>Take the first pair (<em>c</em>0&#8203;,<em>c</em>1&#8203;) and the second pair (<em>c</em>2&#8203;,<em>c</em>3&#8203;). Multiply each by a carefully chosen constant:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\\text{prod}_0 &amp;= M_0 \\times c_0 \\\\\n\\text{prod}_1 &amp;= M_1 \\times c_2\n\\end{align}\n&quot;,&quot;id&quot;:&quot;MBYYRCZSMB&quot;}" data-component-name="LatexBlockToDOM"></div><p>For Philox-4x32, these constants are:</p><ul><li><p><em>M</em>0&#8203;=0xD2511F53</p></li><li><p><em>M</em>1&#8203;=0xCD9E8D57</p></li></ul><p>These constants were chosen through careful analysis to ensure good statistical properties. When we multiply two 32-bit numbers, we get a 64-bit result. We split this into:</p><ul><li><p><strong>High 32 bits</strong>: hi(prod)</p></li><li><p><strong>Low 32 bits</strong>: lo(prod)</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!esaX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!esaX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png 424w, https://substackcdn.com/image/fetch/$s_!esaX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png 848w, https://substackcdn.com/image/fetch/$s_!esaX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png 1272w, https://substackcdn.com/image/fetch/$s_!esaX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!esaX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png" width="223" height="420" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:420,&quot;width&quot;:223,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:12749,&quot;alt&quot;:&quot;The multiplication of two 32-bit values c0 and M0 produces a 64-bit result which is split into hi and lo parts&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The multiplication of two 32-bit values c0 and M0 produces a 64-bit result which is split into hi and lo parts" title="The multiplication of two 32-bit values c0 and M0 produces a 64-bit result which is split into hi and lo parts" srcset="https://substackcdn.com/image/fetch/$s_!esaX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png 424w, https://substackcdn.com/image/fetch/$s_!esaX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png 848w, https://substackcdn.com/image/fetch/$s_!esaX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png 1272w, https://substackcdn.com/image/fetch/$s_!esaX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06cdfa72-4c2f-4d77-b65b-81226f1aaecf_223x420.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The multiplication of two 32-bit values c0 and M0 produces a 64-bit result which is split into hi and lo parts</figcaption></figure></div><h4><strong>Step 2: XOR with Keys</strong></h4><p>The high parts are XORed with round-specific keys derived from the seed, and with the other input values:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\nh_0 = \\text{hi}(\\text{prod}_0) \\oplus c_1 \\oplus k_0 \\\\\n\nh_1 = \\text{hi}(\\text{prod}_1) \\oplus c_3 \\oplus k_1\n\\end{align}\n&quot;,&quot;id&quot;:&quot;BXVSGVGXMN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, <em>k</em>0&#8203; and <em>k</em>1&#8203; are the key values (derived from the seed), and &#8853; represents the XOR operation.</p><h4><strong>Step 3: Permutation</strong></h4><p>Finally, we rearrange the values for the next round. The output of one round becomes:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(c_0&#8217;, c_1&#8217;, c_2&#8217;, c_3&#8217;) = (\\text{lo}(\\text{prod}_0), h_1, \\text{lo}(\\text{prod}_1), h_0)&quot;,&quot;id&quot;:&quot;YKICCLBFGZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Notice how the values are shuffled: the low parts of the products go to positions 0 and 2, while the XORed high parts are swapped and go to positions 1 and 3.</p><h4><strong>Multiple Rounds</strong></h4><p>To achieve good randomness, Philox-4x32 typically applies <strong>10 rounds</strong> of this transformation. After each round except the last, the keys are also updated:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\nk_0&#8217; = k_0 + w_0 \\\\\n\nk_1&#8217; = k_1 + w_1\n\n\\end{align}&quot;,&quot;id&quot;:&quot;EBSFTVKMRD&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where <em>w</em>0&#8203;=0x9E3779B9 and <em>w</em>1&#8203;=0xBB67AE85 are the &#8220;<a href="https://en.wikipedia.org/wiki/Weyl_sequence">Weyl sequence</a>&#8221; constants derived from the golden ratio. This ensures that each round uses different key material, increasing the mixing of the input bits.</p><h3>Visualizing a Complete Philox Transformation</h3><p>The following diagram shows the complete flow through multiple rounds:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WoQ6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WoQ6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png 424w, https://substackcdn.com/image/fetch/$s_!WoQ6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png 848w, https://substackcdn.com/image/fetch/$s_!WoQ6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png 1272w, https://substackcdn.com/image/fetch/$s_!WoQ6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WoQ6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png" width="515" height="980" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83e565a9-aa32-4665-980d-09e524512bf3_515x980.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:980,&quot;width&quot;:515,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:95821,&quot;alt&quot;:&quot;The complete Philox transformation across multiple rounds producing four 32-bit random integers&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The complete Philox transformation across multiple rounds producing four 32-bit random integers" title="The complete Philox transformation across multiple rounds producing four 32-bit random integers" srcset="https://substackcdn.com/image/fetch/$s_!WoQ6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png 424w, https://substackcdn.com/image/fetch/$s_!WoQ6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png 848w, https://substackcdn.com/image/fetch/$s_!WoQ6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png 1272w, https://substackcdn.com/image/fetch/$s_!WoQ6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e565a9-aa32-4665-980d-09e524512bf3_515x980.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The complete Philox transformation across multiple rounds producing four 32-bit random integers</figcaption></figure></div><h3>Why This Works</h3><p>The Philox algorithm achieves good randomness through several mechanisms:</p><ol><li><p><strong>Multiplication</strong> is a non-linear operation that mixes bits effectively. Small changes in input lead to large changes in output.</p></li><li><p><strong>High-low splitting</strong> ensures we use all 64 bits of the multiplication result, not just the lower 32 bits.</p></li><li><p><strong>XOR operations</strong> combine different data streams (keys, previous values) in a way that&#8217;s invertible but unpredictable without knowing the key.</p></li><li><p><strong>Permutation</strong> ensures that the mixing effect propagates to all output positions across rounds.</p></li><li><p><strong>Multiple rounds</strong> compound these effects, ensuring that every output bit depends on every input bit in a complex way.</p></li></ol><p>The algorithm has been extensively tested and passes standard statistical tests for randomness like the TestU01 suite, making it suitable for scientific computing and machine learning applications.</p><h3>Properties of Philox</h3><p>Before we dive into PyTorch&#8217;s implementation, let&#8217;s summarize the key properties that make Philox attractive:</p><ul><li><p><strong>Parallel-friendly</strong>: A GPU with thousands of cores can generate thousands of random numbers simultaneously, each using a different counter value.</p></li><li><p><strong>Deterministic</strong>: Given the same seed and counter, you always get the same output.</p></li><li><p><strong>Long period</strong>: With a 128-bit counter, you can generate 2^128 random numbers before the sequence repeats numbers, more than enough for any practical application.</p></li><li><p><strong>Fast</strong>: The operations (multiplication, XOR, bit shifting) are primitive operations that run very efficiently on modern CPUs and GPUs.</p></li><li><p><strong>Memory efficient</strong>: The generator state is just the counter and key, requiring minimal storage per thread.</p></li></ul><p>Next, let&#8217;s understand how Philox can be parallelized.</p><div><hr></div><h2>Parallelizing Philox: Subsequences and Offsets</h2><p>Now that we understand how the Philox algorithm works, let&#8217;s explore what makes it particularly powerful for parallel computing: the ability to generate random numbers across thousands of threads simultaneously without any coordination.</p><h3>The Random Number Space</h3><p>Recall that Philox is a counter-based PRNG. At its core, it&#8217;s a function that maps a 128-bit counter to a 128-bit random output:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Philox}(\\text{counter}, \\text{key}) \\rightarrow \\text{random_output}&quot;,&quot;id&quot;:&quot;KUAOYTFBMM&quot;}" data-component-name="LatexBlockToDOM"></div><p>Given a fixed key (derived from the seed), each unique counter value produces a unique set of random numbers. Since we have a 128-bit counter, we have:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;2^{128} \\approx 3.4 \\times 10^{38} \\text{ possible counter values}&quot;,&quot;id&quot;:&quot;HGAJILCZMR&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each counter value produces 4 random 32-bit numbers (since 128 bits = 4 &#215; 32 bits), giving us an enormous space of random numbers. We can visualize this as a huge one-dimensional array:</p><pre><code><code>
Counter: 0 1 2 3 ... 2^128-1

&#8595; &#8595; &#8595; &#8595; &#8595;

Output: [r&#8320;,r&#8321;,r&#8322;,r&#8323;][r&#8324;,r&#8325;,r&#8326;,r&#8327;][r&#8328;,r&#8329;,r&#8321;&#8320;,r&#8321;&#8321;][r&#8321;&#8322;,...]...[...]
</code></code></pre><p>How do we partition this massive space across parallel threads? One approach is to split the counter space between the threads.</p><h3>Partitioning the Counter Space</h3><p>The key insight is that we can split the 128-bit counter into two parts and use them to create a 2D address space. Think of the counter as having 4 components of 32 bits each: (<em>c</em>0&#8203;,<em>c</em>1&#8203;,<em>c</em>2&#8203;,<em>c</em>3&#8203;).</p><p>We can partition this as:</p><ul><li><p><strong>Upper 64 bits</strong>: Which thread&#8217;s region we&#8217;re in</p></li><li><p><strong>Lower 64 bits</strong> : The position within a thread&#8217;s assigned region</p></li></ul><p>This partitioning scheme gives each thread its own &#8220;slice&#8221; of the random number space:</p><ul><li><p><strong>Thread 0</strong> gets counters: (&#8727;,&#8727;,0,0) where &#8727;&#8727; can be any value</p></li><li><p>counter = (0,0,0,0) &#8594; first 4 random numbers for thread 0</p></li><li><p>counter = (1,0,0,0) &#8594; next 4 random numbers for thread 0</p></li><li><p>counter = (2,0,0,0) &#8594; next 4 random numbers for thread 0</p></li><li><p>&#8230;</p></li><li><p><strong>Thread 1</strong> gets counters: (&#8727;,&#8727;,1,0)</p></li><li><p>counter = (0,0,1,0) &#8594; first 4 random numbers for thread 1</p></li><li><p>counter = (1,0,1,0) &#8594; next 4 random numbers for thread 1</p></li><li><p>counter = (2,0,1,0) &#8594; next 4 random numbers for thread 1</p></li><li><p>&#8230;</p></li><li><p><strong>Thread 2</strong> gets counters: (&#8727;,&#8727;,2,0)</p></li><li><p>counter = (0,0,2,0) &#8594; first 4 random numbers for thread 2</p></li><li><p>And so on&#8230;</p></li></ul><h3>Terminology: Subsequence and Offset</h3><p>We now give names to these two parts:</p><p><strong>Subsequence</strong>: The upper 64 bits of the counter. This identifies which parallel thread or stream we&#8217;re referring to. We can have up to 2^64 different subsequences running in parallel.</p><p><strong>Offset</strong>: The lower 64 bits of the counter. This identifies the position within a subsequence. Each subsequence can generate up to 2^64 sets of random numbers.</p><p>Together, they form a coordinate system (<em>s</em>,<em>o</em>) where:</p><ul><li><p><em>s</em> is the subsequence (which parallel stream)</p></li><li><p><em>o</em> is the offset (position in that stream)</p></li></ul><p>The total capacity is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;2^{64} \\text{ subsequences} \\times 2^{64} \\text{ offsets per subsequence} = 2^{128} \\text{ total positions}\n\n&quot;,&quot;id&quot;:&quot;THGEKREXMA&quot;}" data-component-name="LatexBlockToDOM"></div><p>This matches exactly the size of our original counter space, we&#8217;ve simply reorganized it into a 2D structure that&#8217;s easy to partition across threads.</p><h3>How Offsets Increment</h3><p>When a thread generates more random numbers, it increments the offset portion of the counter. Since Philox generates 4 random numbers at once, we typically increment by 1 each time (remembering that each offset value produces 4 numbers):</p><pre><code><code>
Thread 0 subsequence = 0:

offset=0: counter=[0,0,0,0] &#8594; Philox &#8594; [rand&#8320;, rand&#8321;, rand&#8322;, rand&#8323;]

offset=1: counter=[1,0,0,0] &#8594; Philox &#8594; [rand&#8324;, rand&#8325;, rand&#8326;, rand&#8327;]

offset=2: counter=[2,0,0,0] &#8594; Philox &#8594; [rand&#8328;, rand&#8329;, rand&#8321;&#8320;, rand&#8321;&#8321;]

...
</code></code></pre><p>The offset is really tracking &#8220;which batch of 4&#8221; we&#8217;re on. If we need the 10th random number (index 9, counting from 0):</p><ul><li><p>Offset = &#8970;9/4&#8971;=2</p></li><li><p>Position within batch = 19mod4=1</p></li><li><p>So we use counter [2,0,0,0] and take the second output (index 1)<code>
</code></p></li></ul><h3>The Power of Skip-Ahead</h3><p>One powerful consequence of this design is <strong>skip-ahead</strong>: a thread can jump directly to any offset without computing intermediate values.</p><pre><code><code>
Thread 0:

- Jump to offset 1,000,000: counter = [1000000, 0, 0, 0]

- Generate random numbers at this position

- Jump to offset 5,000,000: counter = [5000000, 0, 0, 0]

- No need to compute offsets 1 through 4,999,999!

</code></code></pre><p>This is impossible with traditional sequential PRNGs where state n+1<em>n</em>+1 depends on state n<em>n</em>.</p><h3>Setting Up for PyTorch</h3><p>Now that we understand how the counter space is partitioned, we can see how PyTorch uses this:</p><p>When PyTorch generates random numbers on a GPU:</p><ol><li><p>It launches many threads (e.g., 1024 threads)</p></li><li><p>Each thread is assigned a unique <strong>subsequence</strong> number (typically its thread ID)</p></li><li><p>Each thread starts at <strong>offset</strong> 0 within its subsequence</p></li><li><p>As each thread generates random numbers, it increments its offset</p></li><li><p>PyTorch tracks the global offset to ensure future operations don&#8217;t reuse the same counters</p></li></ol><p>With this foundation, let&#8217;s now explore how PyTorch implements these concepts in its Philox engine.</p><div><hr></div><h2>Philox Implementation in PyTorch</h2><p>PyTorch uses Philox-4x32-10 (4 values of 32 bits, 10 rounds) as its primary PRNG for CUDA operations. The implementation lives in <a href="https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/PhiloxRNGEngine.h">aten/src/ATen/core/PhiloxRNGEngine.h</a> and is designed to work on both CPU and GPU (via CUDA). Let&#8217;s dissect this implementation to understand how the theoretical concepts we discussed earlier translate into actual code.</p><h3>Core Data Structures</h3><p>The implementation starts by defining some type aliases for clarity:</p><pre><code><code>
typedef std::array&lt;uint32_t, 4&gt; UINT4; // Four 32-bit integers

typedef std::array&lt;uint32_t, 2&gt; UINT2; // Two 32-bit integers

typedef std::array&lt;double, 2&gt; DOUBLE2; // Two doubles

typedef std::array&lt;float, 2&gt; FLOAT2; // Two floats

</code></code></pre><p>These typedefs make the code more readable. <code>UINT4</code> represents the 128-bit counter or output (4 &#215; 32 bits = 128 bits), while <code>UINT2</code> represents the 64-bit key (2 &#215; 32 bits = 64 bits).</p><h3>The PhiloxEngine Class Structure</h3><p>The <code>philox_engine</code> class maintains four critical pieces of state:</p><pre><code><code>
private:

detail::UINT4 counter_; // 128-bit counter (c&#8320;, c&#8321;, c&#8322;, c&#8323;)
detail::UINT4 output_; // Cached output from last round
detail::UINT2 key_; // 64-bit key derived from seed (k&#8320;, k&#8321;)
uint32_t STATE; // Position in current output (0-3)
</code></code></pre><p>Let&#8217;s understand each field:</p><p><code>counter_</code>: This is the 128-bit counter that gets incremented and transformed through the Philox rounds. It&#8217;s divided into four 32-bit components:</p><ul><li><p><code>counter_[0]</code> and <code>counter_[1]</code>: Lower 64 bits represent the <strong>offset</strong> (which random number in the subsequence)</p></li><li><p><code>counter_[2]</code> and <code>counter_[3]</code>: Upper 64 bits represent the <strong>subsequence</strong> (which parallel stream)</p></li></ul><p><code>key_</code>: The 64-bit key derived from the seed. This remains constant for a given seed and is used in the XOR operations during each round.</p><p><code>output_</code>: Philox generates 4 random 32-bit numbers at once. This field caches those numbers so we don&#8217;t have to recompute them for every call.</p><p><code>STATE</code>: A simple counter (0-3) that tracks which of the four cached output values to return next. This is an optimization to avoid regenerating when we have unused random numbers.</p><h3>Initialization and State Management</h3><p>The constructor initializes the engine with a seed, subsequence, and offset:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tizd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tizd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png 424w, https://substackcdn.com/image/fetch/$s_!Tizd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png 848w, https://substackcdn.com/image/fetch/$s_!Tizd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png 1272w, https://substackcdn.com/image/fetch/$s_!Tizd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tizd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png" width="888" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:888,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34316,&quot;alt&quot;:&quot;The philox_engine constructor definition&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The philox_engine constructor definition" title="The philox_engine constructor definition" srcset="https://substackcdn.com/image/fetch/$s_!Tizd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png 424w, https://substackcdn.com/image/fetch/$s_!Tizd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png 848w, https://substackcdn.com/image/fetch/$s_!Tizd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png 1272w, https://substackcdn.com/image/fetch/$s_!Tizd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b98853b-f9b0-4e54-9d41-0b8c01ca24f6_888x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The philox_engine constructor definition</figcaption></figure></div><p>The <code>C10_HOST_DEVICE</code> macro is crucial here, it tells the compiler that this function can run on both the CPU (host) and GPU (device). This allows the same code to be used in both contexts.</p><p>Let&#8217;s look at how <code>reset_state</code> sets up the initial state:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Us5U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Us5U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png 424w, https://substackcdn.com/image/fetch/$s_!Us5U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png 848w, https://substackcdn.com/image/fetch/$s_!Us5U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png 1272w, https://substackcdn.com/image/fetch/$s_!Us5U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Us5U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png" width="1302" height="245" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:245,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90887,&quot;alt&quot;:&quot;The reset_state function that resets the state of the philox_engine&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The reset_state function that resets the state of the philox_engine" title="The reset_state function that resets the state of the philox_engine" srcset="https://substackcdn.com/image/fetch/$s_!Us5U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png 424w, https://substackcdn.com/image/fetch/$s_!Us5U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png 848w, https://substackcdn.com/image/fetch/$s_!Us5U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png 1272w, https://substackcdn.com/image/fetch/$s_!Us5U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a2d726-2b10-4d76-aacc-d1d764a78467_1302x245.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The reset_state function that resets the state of the philox_engine</figcaption></figure></div><p>This initialization strategy is clever:</p><ol><li><p>The <strong>seed</strong> is split into the two key components <code>key_[0]</code> and <code>key_[1]</code></p></li><li><p>The <strong>subsequence</strong> goes into the upper half of the counter (<code>counter_[2]</code> and <code>counter_[3]</code>)</p></li><li><p>The <strong>offset</strong> (lower half of counter) starts at zero but can be set later via <code>incr_n(offset)</code></p></li></ol><p>This design allows for massive parallelism. Imagine running 1024 CUDA threads simultaneously:</p><pre><code><code>
Thread 0: subsequence=0, offset=0 &#8594; counter = [0, 0, 0, 0]

Thread 1: subsequence=1, offset=0 &#8594; counter = [0, 0, 1, 0]

Thread 2: subsequence=2, offset=0 &#8594; counter = [0, 0, 2, 0]

...

Thread 1023: subsequence=1023, offset=0 &#8594; counter = [0, 0, 1023, 0]

</code></code></pre><p>Each thread has a unique counter value from the start, so they all generate independent random sequences without any coordination.</p><h3>The Core Algorithm: Single Round</h3><p>Now let&#8217;s examine the heart of the Philox algorithm&#8212;the <code>single_round</code> function:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VusU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VusU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png 424w, https://substackcdn.com/image/fetch/$s_!VusU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png 848w, https://substackcdn.com/image/fetch/$s_!VusU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png 1272w, https://substackcdn.com/image/fetch/$s_!VusU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VusU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png" width="1031" height="320" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:320,&quot;width&quot;:1031,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63721,&quot;alt&quot;:&quot;The single_round function that implements one round of Philox&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The single_round function that implements one round of Philox" title="The single_round function that implements one round of Philox" srcset="https://substackcdn.com/image/fetch/$s_!VusU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png 424w, https://substackcdn.com/image/fetch/$s_!VusU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png 848w, https://substackcdn.com/image/fetch/$s_!VusU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png 1272w, https://substackcdn.com/image/fetch/$s_!VusU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bd1ca64-b049-4283-a360-82fd0bd2bc36_1031x320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The single_round function that implements one round of Philox</figcaption></figure></div><p>Let&#8217;s break this down step by step, mapping it to our earlier theoretical description:</p><h4><strong>Step 1: Multiply and Split</strong></h4><pre><code><code>uint32_t lo0 = mulhilo32(kPhiloxSA, ctr[0], &amp;hi0);
uint32_t lo1 = mulhilo32(kPhiloxSB, ctr[2], &amp;hi1);</code></code></pre><p>Here we multiply:</p><ul><li><p><code>ctr[0]</code> by <code>kPhiloxSA</code> (the constant 0xD2511F53)</p></li><li><p><code>ctr[2]</code> by <code>kPhiloxSB</code> (the constant 0xCD9E8D57)</p></li></ul><p>The <code>mulhilo32</code> function performs the multiplication and splits the 64-bit result:</p><ul><li><p>Returns the low 32 bits (<code>lo0</code> or <code>lo1</code>)</p></li><li><p>Stores the high 32 bits in the passed pointer (<code>hi0</code> or <code>hi1</code>)</p></li></ul><p>Let&#8217;s look at <code>mulhilo32</code> itself:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gxWg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gxWg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png 424w, https://substackcdn.com/image/fetch/$s_!gxWg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png 848w, https://substackcdn.com/image/fetch/$s_!gxWg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png 1272w, https://substackcdn.com/image/fetch/$s_!gxWg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gxWg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png" width="756" height="320" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:320,&quot;width&quot;:756,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55814,&quot;alt&quot;:&quot;The definition of the mulhilo32 function&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The definition of the mulhilo32 function" title="The definition of the mulhilo32 function" srcset="https://substackcdn.com/image/fetch/$s_!gxWg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png 424w, https://substackcdn.com/image/fetch/$s_!gxWg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png 848w, https://substackcdn.com/image/fetch/$s_!gxWg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png 1272w, https://substackcdn.com/image/fetch/$s_!gxWg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03b1c24-95c0-4332-a1b9-3b8a70684da8_756x320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The definition of the mulhilo32 function</figcaption></figure></div><p>This function has two implementations:</p><p><strong>On CUDA (GPU)</strong>: Uses the intrinsic <code>__umulhi</code> which directly computes the high 32 bits of a multiplication. This is extremely fast on GPU hardware.</p><p><strong>On CPU</strong>: Promotes both operands to 64 bits, multiplies them, then extracts high and low parts manually via shifting and casting.</p><p>Here&#8217;s what happens mathematically:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n\\text{prod}_0 &amp;= \\text{kPhiloxSA} \\times \\text{ctr}[0] = \\text{0xD2511F53} \\times \\text{ctr}[0] \\\\\n\n\\text{lo}_0 &amp;= \\text{prod}_0\\text{ } \\And \\text{ 0xFFFFFFFF} \\quad \\text{(lower 32 bits)} \\\\\n\n\\text{hi}_0 &amp;= \\text{prod}_0 \\gg 32 \\quad \\text{(upper 32 bits)}\n\n\\end{align}\n\n&quot;,&quot;id&quot;:&quot;QLOQSBLHON&quot;}" data-component-name="LatexBlockToDOM"></div><h4><strong>Step 2: XOR and Permute</strong></h4><pre><code><code>ret[0] = hi1 ^ ctr[1] ^ in_key[0];
ret[1] = lo1;
ret[2] = hi0 ^ ctr[3] ^ in_key[1];
ret[3] = lo0;</code></code></pre><p>Notice the pattern:</p><ul><li><p><code>ret[0]</code>: Takes <code>hi1</code> (high bits from second multiplication), XORs with <code>ctr[1]</code> and <code>in_key[0]</code></p></li><li><p><code>ret[1]</code>: Simply uses <code>lo1</code> (low bits from second multiplication)</p></li><li><p><code>ret[2]</code>: Takes <code>hi0</code> (high bits from first multiplication), XORs with <code>ctr[3]</code> and <code>in_key[1]</code></p></li><li><p><code>ret[3]</code>: Simply uses <code>lo0</code> (low bits from first multiplication)</p></li></ul><p>Let us visualize this transformation:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Tmh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Tmh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png 424w, https://substackcdn.com/image/fetch/$s_!9Tmh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png 848w, https://substackcdn.com/image/fetch/$s_!9Tmh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png 1272w, https://substackcdn.com/image/fetch/$s_!9Tmh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Tmh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png" width="679" height="595" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2715325-738d-4d65-b2cc-0b77958801b2_679x595.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:595,&quot;width&quot;:679,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46641,&quot;alt&quot;:&quot;Visualization of the operations performed during a single round of Philox&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Visualization of the operations performed during a single round of Philox" title="Visualization of the operations performed during a single round of Philox" srcset="https://substackcdn.com/image/fetch/$s_!9Tmh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png 424w, https://substackcdn.com/image/fetch/$s_!9Tmh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png 848w, https://substackcdn.com/image/fetch/$s_!9Tmh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png 1272w, https://substackcdn.com/image/fetch/$s_!9Tmh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2715325-738d-4d65-b2cc-0b77958801b2_679x595.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visualization of the operations performed during a single round of Philox</figcaption></figure></div><p></p><p>This permutation ensures that bits from different positions get mixed together in subsequent rounds.</p><h3>Constants: The Magic Numbers</h3><p>You might wonder where these constants come from:</p><pre><code><code>
static const uint32_t kPhilox10A = 0x9E3779B9; // Weyl sequence
static const uint32_t kPhilox10B = 0xBB67AE85; // Weyl sequence
static const uint32_t kPhiloxSA = 0xD2511F53; // Multiplier
static const uint32_t kPhiloxSB = 0xCD9E8D57; // Multiplier

</code></code></pre><p><strong>Weyl sequence constants</strong> (<code>kPhilox10A</code> and <code>kPhilox10B</code>): These are derived from the golden ratio. The constants are:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n\\text{kPhilox10A} &amp;= \\lfloor 2^{32} / \\phi \\rfloor = \\text{0x9E3779B9} \\\\\n\n\\text{kPhilox10B} &amp;= \\lfloor 2^{32} / \\phi^2 \\rfloor = \\text{0xBB67AE85}\n\n\\end{align}\n\n&quot;,&quot;id&quot;:&quot;ZZGYZVPQKU&quot;}" data-component-name="LatexBlockToDOM"></div><p>The golden ratio has special properties that make it useful for distributing values uniformly. These constants are added to the key after each round to ensure different key material is used.</p><p><strong>Multiplier constants</strong> (<code>kPhiloxSA</code> and <code>kPhiloxSB</code>): These were carefully chosen through empirical testing to maximize statistical quality. They need to have good bit-mixing properties when multiplied with typical counter values.</p><h3>Running Multiple Rounds</h3><p>The <code>rand</code> function orchestrates running all rounds:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4l43!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4l43!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png 424w, https://substackcdn.com/image/fetch/$s_!4l43!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png 848w, https://substackcdn.com/image/fetch/$s_!4l43!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png 1272w, https://substackcdn.com/image/fetch/$s_!4l43!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4l43!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png" width="1207" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:1207,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44022,&quot;alt&quot;:&quot;Definition of the rand function that applies multiple rounds of Philox to produce random numbers&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Definition of the rand function that applies multiple rounds of Philox to produce random numbers" title="Definition of the rand function that applies multiple rounds of Philox to produce random numbers" srcset="https://substackcdn.com/image/fetch/$s_!4l43!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png 424w, https://substackcdn.com/image/fetch/$s_!4l43!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png 848w, https://substackcdn.com/image/fetch/$s_!4l43!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png 1272w, https://substackcdn.com/image/fetch/$s_!4l43!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7760085-67fc-4b51-ace0-ba4dde21f671_1207x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Definition of the rand function that applies multiple rounds of Philox to produce random numbers</figcaption></figure></div><p>This is straightforward:</p><ol><li><p>Run <code>n_rounds - 1</code> iterations where we:</p><ol><li><p>Apply <code>single_round</code> to transform the counter</p></li><li><p>Update the key by adding the Weyl constants</p></li></ol></li><li><p>Apply one final round without updating the key</p></li></ol><p>By default, PyTorch uses 10 rounds (<code>n_rounds = 10</code>), which provides a good balance between performance and statistical quality.</p><h3>Generating Random Numbers: The Operator</h3><p>The operator <code>()</code> is what users call to get random numbers:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q0Tp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png 424w, https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png 848w, https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png 1272w, https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png" width="1141" height="395" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9847528b-a753-4441-9f78-af680fa5b649_1141x395.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:395,&quot;width&quot;:1141,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66418,&quot;alt&quot;:&quot;Definition of the operator() that is called by users to generate random numbers&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Definition of the operator() that is called by users to generate random numbers" title="Definition of the operator() that is called by users to generate random numbers" srcset="https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png 424w, https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png 848w, https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png 1272w, https://substackcdn.com/image/fetch/$s_!Q0Tp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9847528b-a753-4441-9f78-af680fa5b649_1141x395.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Definition of the operator() that is called by users to generate random numbers</figcaption></figure></div><p>This function is clever in its efficiency:</p><p><strong>Check if we need new random numbers</strong>: <code>if(STATE == 0)</code> checks if we&#8217;ve exhausted the previous batch. Remember, <code>STATE</code> cycles through 0, 1, 2, 3.</p><p><strong>Generate a batch</strong>: When needed, it:</p><ul><li><p>Runs the full Philox algorithm via <code>rand(counter, key, n_rounds)</code></p></li><li><p>Stores the result in <code>output_</code> (four 32-bit random numbers)</p></li><li><p>Increments the counter for next time via <code>incr()</code></p></li></ul><p><strong>Return next value</strong>: Grab the current position from <code>output_</code>, then advance <code>STATE</code>.</p><p>The line <code>STATE = (STATE + 1) &amp; 3</code> is a bit trick equivalent to <code>STATE = (STATE + 1) % 4</code>, using bitwise AND since 3 is binary <code>11</code>.</p><p>This batching strategy is a significant performance optimization. Instead of running Philox for every random number, we run it once per four random numbers.</p><h3>Counter Increment Logic</h3><p>The counter increment operations deserve special attention because they handle the 128-bit arithmetic correctly. Let&#8217;s start with the simple case:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bN-H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bN-H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png 424w, https://substackcdn.com/image/fetch/$s_!bN-H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png 848w, https://substackcdn.com/image/fetch/$s_!bN-H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png 1272w, https://substackcdn.com/image/fetch/$s_!bN-H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bN-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png" width="679" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/416d677c-6f56-4432-a083-b5eede24fa16_679x345.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:679,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43264,&quot;alt&quot;:&quot;Definition of the incr function that increments the counter&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Definition of the incr function that increments the counter" title="Definition of the incr function that increments the counter" srcset="https://substackcdn.com/image/fetch/$s_!bN-H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png 424w, https://substackcdn.com/image/fetch/$s_!bN-H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png 848w, https://substackcdn.com/image/fetch/$s_!bN-H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png 1272w, https://substackcdn.com/image/fetch/$s_!bN-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F416d677c-6f56-4432-a083-b5eede24fa16_679x345.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Definition of the incr function that increments the counter</figcaption></figure></div><p>This increments the 128-bit counter by 1. The logic is:</p><ol><li><p>Increment <code>counter_[0]</code> (least significant 32 bits)</p></li><li><p>If it&#8217;s non-zero after increment, we&#8217;re done (no overflow)</p></li><li><p>If it overflowed to zero, carry to <code>counter_[1]</code></p></li><li><p>Continue propagating carries until we find a non-zero result</p></li></ol><p>The more complex function is <code>incr_n</code>, which increments by an arbitrary 64-bit value:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aoqZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aoqZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png 424w, https://substackcdn.com/image/fetch/$s_!aoqZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png 848w, https://substackcdn.com/image/fetch/$s_!aoqZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png 1272w, https://substackcdn.com/image/fetch/$s_!aoqZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aoqZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png" width="690" height="845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f086c0b6-5824-4427-8780-d6cf28397830_690x845.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:845,&quot;width&quot;:690,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125482,&quot;alt&quot;:&quot;Definition of incr_n function that increments the counter by an arbitrary 64-bit value&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Definition of incr_n function that increments the counter by an arbitrary 64-bit value" title="Definition of incr_n function that increments the counter by an arbitrary 64-bit value" srcset="https://substackcdn.com/image/fetch/$s_!aoqZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png 424w, https://substackcdn.com/image/fetch/$s_!aoqZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png 848w, https://substackcdn.com/image/fetch/$s_!aoqZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png 1272w, https://substackcdn.com/image/fetch/$s_!aoqZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff086c0b6-5824-4427-8780-d6cf28397830_690x845.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Definition of incr_n function that increments the counter by an arbitrary 64-bit value</figcaption></figure></div><p>This function is more intricate because it needs to:</p><ol><li><p>Split the 64-bit increment <code>n</code> into <code>nlo</code> and <code>nhi</code></p></li><li><p>Add <code>nlo</code> to <code>counter_[0]</code></p></li><li><p>Detect overflow by checking if <code>counter_[0] &lt; nlo</code> (if the result is less than what we added, overflow occurred)</p></li><li><p>If overflow, increment <code>nhi</code> to carry over</p></li><li><p>Add <code>nhi</code> to <code>counter_[1]</code> and check for overflow again</p></li><li><p>If still overflowing, propagate to the upper 64 bits</p></li></ol><p>The overflow detection <code>counter_[0] &lt; nlo</code> is a standard technique in multi-precision arithmetic. After adding, if the result is less than one of the operands, an overflow must have occurred since we&#8217;re working with unsigned integers.</p><h3>Converting to Floating Point</h3><p>For machine learning applications, we often need floating-point random numbers in the range [0, 1), while Philox gives us integers. So, PyTorch applies a conversion function:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wsvE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wsvE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png 424w, https://substackcdn.com/image/fetch/$s_!wsvE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png 848w, https://substackcdn.com/image/fetch/$s_!wsvE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png 1272w, https://substackcdn.com/image/fetch/$s_!wsvE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wsvE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png" width="877" height="145" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:145,&quot;width&quot;:877,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39753,&quot;alt&quot;:&quot;Definition of the uint32_to_uniform_float function that converts a 32-bit integer to a float value in the range [0,1)&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Definition of the uint32_to_uniform_float function that converts a 32-bit integer to a float value in the range [0,1)" title="Definition of the uint32_to_uniform_float function that converts a 32-bit integer to a float value in the range [0,1)" srcset="https://substackcdn.com/image/fetch/$s_!wsvE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png 424w, https://substackcdn.com/image/fetch/$s_!wsvE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png 848w, https://substackcdn.com/image/fetch/$s_!wsvE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png 1272w, https://substackcdn.com/image/fetch/$s_!wsvE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23aebf6-cf3c-446c-bf27-0947d92c1221_877x145.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Definition of the uint32_to_uniform_float function that converts a 32-bit integer to a float value in the range [0,1)</figcaption></figure></div><p>This function is carefully designed:</p><p><strong>Mask off sign bit</strong>: <code>value &amp; 0x7FFFFFFF</code> clears the highest bit, giving us values from 0 to 2^31&#8722;1</p><p><strong>Scale down</strong>: Multiplying by <code>scale = 4.6566127342e-10</code> maps these integers to floats in [0, 1).</p><p>The scale factor is calculated as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{scale} = \\frac{1}{2^{31}} \\approx 4.6566127342 \\times 10^{-10}&quot;,&quot;id&quot;:&quot;CXCXOBPBDG&quot;}" data-component-name="LatexBlockToDOM"></div><p>Why use only 31 bits instead of all 32? Because:</p><ol><li><p>We want only positive values (for [0, 1) range)</p></li><li><p>The highest representable float less than 1.0 needs careful handling</p></li><li><p>Using 31 bits avoids potential rounding issues near 1.0</p></li></ol><h3>Normal Distribution Generation</h3><p>The <code>randn</code> function generates normally distributed random numbers using the Box-Muller transform:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vzbk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vzbk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png 424w, https://substackcdn.com/image/fetch/$s_!vzbk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png 848w, https://substackcdn.com/image/fetch/$s_!vzbk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png 1272w, https://substackcdn.com/image/fetch/$s_!vzbk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vzbk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png" width="1009" height="419" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:1009,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84602,&quot;alt&quot;:&quot;Definition of the randn function that generates random numbers from a normal distribution&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/181426624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Definition of the randn function that generates random numbers from a normal distribution" title="Definition of the randn function that generates random numbers from a normal distribution" srcset="https://substackcdn.com/image/fetch/$s_!vzbk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png 424w, https://substackcdn.com/image/fetch/$s_!vzbk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png 848w, https://substackcdn.com/image/fetch/$s_!vzbk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png 1272w, https://substackcdn.com/image/fetch/$s_!vzbk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2e2b8d-6d46-4c4f-a199-14bb538d230f_1009x419.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Definition of the randn function that generates random numbers from a normal distribution</figcaption></figure></div><p>The <a href="https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform">Box-Muller transform</a> converts two uniform random variables <em>U</em>1&#8203;,<em>U</em>2&#8203;&#8764;Uniform(0,1) into a normal random variable <em>Z</em>&#8764;N(0,1):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Z = \\sqrt{-2 \\ln U_1} \\cos(2\\pi U_2)&quot;,&quot;id&quot;:&quot;KHHARVVYDI&quot;}" data-component-name="LatexBlockToDOM"></div><h3>Memory Layout and Efficiency</h3><p>One of the beauties of this implementation is how compact the state is. Each <code>philox_engine</code> instance requires:</p><pre><code><code>
counter_: 4 &#215; 4 bytes = 16 bytes

output_: 4 &#215; 4 bytes = 16 bytes

key_: 2 &#215; 4 bytes = 8 bytes

STATE: 4 bytes = 4 bytes

Total = 44 bytes</code></code></pre><p>This is tiny! On a GPU, you could have millions of these generators running in parallel, each consuming only 44 bytes. In comparision, traditional RNGs can take kilobytes of state per instance.</p><div><hr></div><h2>Summary</h2><p>In this article, we explored Philox, a counter-based PRNG designed for parallel computing environments. We learned:</p><ol><li><p><strong>Why traditional PRNGs don&#8217;t parallelize well</strong>: Sequential state dependencies create bottlenecks on parallel hardware like GPUs.</p></li><li><p><strong>How Philox works</strong>: By treating random number generation as a function <code>f(counter, key)</code>, Philox allows direct computation of any random number without computing predecessors.</p></li><li><p><strong>The algorithm&#8217;s core operations</strong>: Multiplication with carefully chosen constants, high-low splitting, XOR with key material, and permutation, repeated for 10 rounds to ensure statistical quality.</p></li><li><p><strong>Parallelization through counter partitioning</strong>: The 128-bit counter space is split into subsequences (upper 64 bits) and offsets (lower 64 bits), allowing up to 2^64 parallel threads each generating 2^64 random numbers.</p></li><li><p><strong>PyTorch&#8217;s implementation</strong>: A compact 44-byte state per engine instance, efficient batching of 4 numbers at a time, and careful handling of counter arithmetic for both CPU and GPU execution.</p></li></ol><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Articles like this take time and research to get right. If you&#8217;d like to support more deep dives into CPU internals and performance engineering, you can upgrade to a paid subscription and help keep this work sustainable.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/how-pytorch-generates-random-numbers?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/how-pytorch-generates-random-numbers?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p>]]></content:encoded></item><item><title><![CDATA[x86 Addressing Modes, Part 1 — Immediate and Direct Access]]></title><description><![CDATA[The foundations of memory access: static allocation, addressing modes, and the first steps toward low-level thinking.]]></description><link>https://blog.codingconfessions.com/p/x86-addressing-modes-part-1-immediate</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/x86-addressing-modes-part-1-immediate</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Wed, 12 Nov 2025 16:15:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/352d073f-214a-42e8-8c75-795aef67a908_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome back to our series on x86 assembly programming. If you are new, you can check out the series overview.</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;903595d2-e4a6-487d-8e40-76f7e8dede5c&quot;,&quot;caption&quot;:&quot;Welcome to my ongoing series on x86-64 assembly programming, designed for programmers who want to peel back the abstraction and understand how code really runs at the machine level.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;A Programmer&#8217;s Guide to x86-64 Assembly (Series Overview)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-07-16T05:14:34.997Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!pFGm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/a-programmers-guide-to-x86-64-assembly&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:168445561,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:2,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><p>So far, we have learned the fundamentals of instructions and registers in x86 assembly. But writing real-world programs requires memory access, so we must learn how to deal with memory. If you can master this topic, you level up as a programmer.</p><p>There are two kinds of memory where we can keep our program&#8217;s data: registers and main memory. We have already learned about <a href="https://blog.codingconfessions.com/p/x86-registers">using registers</a>; they are the fastest possible memory units in the hardware. But they are very limited in numbers, while real-world code needs much more memory than that.</p><p>Apart from that, registers can only hold primitive type values. The integer registers (the 16 general-purpose ones we learned about) handle integers, while separate floating-point registers exist in x86 for floating-point operations. However, we need a way to store and access composite types, such as arrays and structs, which is only possible using main memory. </p><p>Accessing memory in assembly is a big topic, so we&#8217;ll split it into a multipart series covering each addressing mode step by step as there are several memory addressing modes, and learning to effectively use each of them is crucial for us to read and write assembly code. So, I am going to split this topic into a multipart series. In this first part, we will cover the following topics:</p><ul><li><p>Regions of memory in a process&#8217;s address space: stack, heap, and data</p></li><li><p>Immediate addressing mode</p></li><li><p>Direct addressing mode</p></li></ul><p>In future parts, we will cover the following:</p><ul><li><p>Indirect addressing mode</p></li><li><p>Offset-based addressing mode</p></li><li><p>Indexed addressing mode</p></li></ul><p>We&#8217;ll start by understanding how data is organized in memory before we explore addressing modes. Now, let&#8217;s dive in!</p><p><em>I&#8217;m also publishing this in the form an ebook (PDF). If you don&#8217;t wish to upgrade to a subscription, you can purchase the PDF using the following link. If you are a paid subscriber, you can get it at a discount (monthly subs: 20% and annual subs: 50%). Please email me for the discounted link.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk&quot;,&quot;text&quot;:&quot;Purchase PDF&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://codingconfessions.gumroad.com/l/ychdk"><span>Purchase PDF</span></a></p><div><hr></div><h2>Regions of Memory in Process Address Space</h2><p>When programming in high-level languages, you would have learned about the concept of scope or the lifetime of a variable. For example, a global variable lives for the duration of the program; local variables are automatically destroyed when the function returns. And, you can dynamically allocate memory on the heap that lives until it is freed.</p><p>When programming in assembly, we need similar scopes. However, there is no compiler to help us out, so we must do it ourselves. These scopes can be achieved by storing data in different regions in the address space of the process. So, we must start there.</p><p>There are three main regions in the process&#8217;s address space where you can decide to store your program&#8217;s data, as shown in the following diagram.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D0tD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D0tD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png 424w, https://substackcdn.com/image/fetch/$s_!D0tD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png 848w, https://substackcdn.com/image/fetch/$s_!D0tD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png 1272w, https://substackcdn.com/image/fetch/$s_!D0tD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D0tD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png" width="1062" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:1062,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55007,&quot;alt&quot;:&quot;Key regions in the address space of a process: stack, heap, and data&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161941599?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Key regions in the address space of a process: stack, heap, and data" title="Key regions in the address space of a process: stack, heap, and data" srcset="https://substackcdn.com/image/fetch/$s_!D0tD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png 424w, https://substackcdn.com/image/fetch/$s_!D0tD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png 848w, https://substackcdn.com/image/fetch/$s_!D0tD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png 1272w, https://substackcdn.com/image/fetch/$s_!D0tD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe352c8d9-efc1-4ae2-b388-1bc4dcc8a8b4_1062x586.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Key regions in the address space of a process: stack, heap, and data</figcaption></figure></div><ul><li><p><strong>Stack segment:</strong> The <code>stack</code> segment is primarily used to implement function calls and to store function local data, such as variables and arguments. We will learn to use the stack when we talk about functions in assembly.</p></li><li><p><strong>Data Segment:</strong> The <code>data</code> segment is used to store static data. For example, whenever you create global variables or constants in your programs, the compiler may put them in the data segment. The advantage of the data segment is that it is burned as part of the program binary and loaded during startup. As a result, there is no memory allocation overhead at runtime.</p></li><li><p><strong>Heap Segment:</strong> The <code>heap</code> segment is used for dynamic memory allocation at runtime. For example, when growing an array, or creating nodes for a tree or a linked list.</p></li></ul><p>In this article, we will mostly use the data segment, and we&#8217;ll cover heap and stack in future articles on dynamic memory allocation and function calls. </p><p>But, before jumping to memory access modes, we should spend a few minutes to learn how to do static memory allocation in the <code>.data</code> section, as we will be using static memory throughout the rest of this article.</p><h3>Static Memory Allocation in the .data section</h3><p>The data segment in the process&#8217;s address space is populated based on the contents in the <code>.data</code> section of the executable binary. When we want to create static data in our program, such as global variables or constants, we can put them in the .data section of our program.</p><p>To create a static value in the <code>.data</code> section, we need to do three things:</p><ul><li><p><strong>Create a label</strong>: At the time of writing assembly, we don&#8217;t know the exact memory address of the values or instructions, so we must use labels. At linking time, the linker replaces labels with the final addresses in the object code that it generates. So, creating a label for the value gives us a way to refer to its address. </p></li><li><p><strong>Declare the size</strong>: We need to tell the assembler the size of the value, so that it can create that much space in the <code>.data</code> section. If you read the <a href="https://blog.codingconfessions.com/p/x86-registers">article on registers</a>, you may recall that we have the following sizes:</p><ul><li><p><code>.quad</code>: For 8-byte values</p></li><li><p><code>.long</code>: For 4-byte values</p></li><li><p><code>.word</code>: For 2-byte values</p></li><li><p><code>.byte</code>: For single-byte values</p></li><li><p>Apart from these, we also have the <code>.asciz</code> macro to create a nul-terminated ASCII string.</p></li></ul></li><li><p><strong>Declare the value</strong>: Finally, provide the value.</p></li></ul><p>The following example shows how we can create an 8-byte integer value in the <code>.data</code> section with the label <code>ANSWER_TO_LIFE</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wKRW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wKRW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png 424w, https://substackcdn.com/image/fetch/$s_!wKRW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png 848w, https://substackcdn.com/image/fetch/$s_!wKRW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png 1272w, https://substackcdn.com/image/fetch/$s_!wKRW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wKRW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png" width="633" height="245" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:245,&quot;width&quot;:633,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43536,&quot;alt&quot;:&quot;Syntax for allocating data in the .data section&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161941599?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Syntax for allocating data in the .data section" title="Syntax for allocating data in the .data section" srcset="https://substackcdn.com/image/fetch/$s_!wKRW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png 424w, https://substackcdn.com/image/fetch/$s_!wKRW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png 848w, https://substackcdn.com/image/fetch/$s_!wKRW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png 1272w, https://substackcdn.com/image/fetch/$s_!wKRW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbac9321c-34f5-455b-899a-1e242d3a76ab_633x245.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Syntax for allocating data in the .data section</figcaption></figure></div><p>This example allocates a single 64-bit value, but it is also possible to create more complex structures. For instance, we can create a struct-like object as shown in the example below:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>This series is exclusively for the paid subscribers. Their support keeps this publication sustainable. To access this series and other exclusive content, please consider upgrading to a paid subscription</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>
      <p>
          <a href="https://blog.codingconfessions.com/p/x86-addressing-modes-part-1-immediate">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[A Systems Engineer’s Guide to Benchmarking with RDTSC]]></title><description><![CDATA[A deep dive into rdtsc, instruction stream serialization, and memory fences for precise cycle-level performance measurement.]]></description><link>https://blog.codingconfessions.com/p/rdtsc</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/rdtsc</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Thu, 23 Oct 2025 11:31:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!4Ex0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Ex0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Ex0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4Ex0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4Ex0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4Ex0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Ex0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245669,&quot;alt&quot;:&quot;Cover image: Depicting rdtsc, lfence and rdtscp instruction with a backdrop of a clock&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/173003537?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cover image: Depicting rdtsc, lfence and rdtscp instruction with a backdrop of a clock" title="Cover image: Depicting rdtsc, lfence and rdtscp instruction with a backdrop of a clock" srcset="https://substackcdn.com/image/fetch/$s_!4Ex0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4Ex0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4Ex0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4Ex0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d1cd0-ef2a-4076-a0d1-cefae8e1206d_1536x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cover image: Depicting rdtsc, lfence and rdtscp instruction with a backdrop of a clock</figcaption></figure></div><p>Performance is critical for systems programmers, and accurate benchmarking is the foundation of meaningful optimization. To truly understand where your code spends time, you need precise and low-overhead measurements, especially when a piece of code may execute in just a few hundred CPU cycles.</p><p>Most developers reach for familiar high-level timers, such as Python&#8217;s <a href="https://docs.python.org/3/library/time.html#time.perf_counter">time.perf_counter()</a> or Java&#8217;s <a href="https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#currentTimeMillis--">System.currentTimeMillis()</a>. These are convenient but rely on system calls like <a href="https://man7.org/linux/man-pages/man3/clock_gettime.3.html">clock_gettime</a> which introduce hundreds of cycles of overhead. In certain situations, this overhead can be too much. And when profiling production systems, you want the overheads to be as minimal as possible.</p><p>We need a way to read time directly from the hardware, without leaving the user space. On x86 systems, that mechanism is the <code>rdtsc</code> instruction. It gives us near-zero-overhead access to the CPU&#8217;s internal timestamp counter, but using it correctly requires an understanding of how modern processors execute instructions.</p><p>In this article, we&#8217;ll learn how to use <code>rdtsc</code> to do benchmarking. Specifically we will cover the following topics in detail:</p><ul><li><p><strong>What </strong><code>rdtsc</code><strong> does:</strong> How it reads the CPU&#8217;s internal timestamp counter and why it provides near-zero-overhead timing.</p></li><li><p><strong>Understanding CPU behavior:</strong> How out-of-order execution can distort timing results and why instruction ordering matters.</p></li><li><p><strong>Instruction stream serialization:</strong> What it means, how the CPU reorders instructions, and how serializing instructions (like <code>cpuid</code>) enforce strict ordering.</p></li><li><p><strong>Memory fences:</strong> How <code>lfence</code>, <code>sfence</code>, and <code>mfence</code> provide lighter-weight ordering guarantees that help isolate measurement code.</p></li><li><p><strong>Combining it all:</strong> Practical example of using these mechanisms together to obtain stable and reproducible timing measurements.</p></li></ul><p>By the end, you&#8217;ll know not only how to use <code>rdtsc</code> safely and accurately but also <em>why</em> these extra steps are essential for meaningful microbenchmarking.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Writing these deep dives takes 100+ hours of work. If you find this valuable and insightful, please consider upgrading to a paid subscription to keep this work alive.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Understanding The Timestamp Counter in the CPU</h2><p>In the x86 architecture, every CPU comes with a special 64-bit counter, called the <em>timestamp counter</em> (TSC) that gets incremented at a fixed frequency. If you can read the value of the counter before and after the execution of a block of code, you can accurately tell how many cycles that code took to execute.</p><p>When the counter overflows, it resets to 0. However, because it is a 64-bit counter, it will take an extremely long time for it to overflow. For instance, if the counter increments at 1 GHz frequency, it will take 585 years for it to overflow.</p><p>The frequency at which the timestamp counter increments is not the same as the real CPU frequency. In the past, it used to be related to the CPU frequency but as recent CPUs started to have dynamic frequency scaling, the timestamp counter was made to tick at a fixed constant frequency to get stable measurements. For example, some of the cores on my laptop have a frequency range of 800 MHz to 4800 MHz, but the TSC ticks at 2.3 GHz.</p><p>So, how do we read the TSC? The x86 instruction set provides two instructions for doing this: <code>rdtsc</code> and <code>rdtscp</code>. But to actually measure the timing of a block of code using these is not as simple as simply slapping <code>rdtsc</code> before and after the code block. It is more sophisticated than that. In practice, it looks like the following code snippet:</p><pre><code>#include &lt;x86intrin.h&gt;

uint32_t cpuid;
_mm_lfence();
uint64_t start = __rdtsc();

for (int i = 0; i &lt; ITERS; i++) {
  // expensive loop body
}

uint64_t end = __rdtscp(&amp;cpuid);
_mm_lfence();
uint64_t ncycles = end - start;</code></pre><p>In this snippet, I have used the GCC compiler intrinsics <code>__rdtsc</code> and <code>__rdtscp</code> for invoking the <code>rdtsc</code> and <code>rdtscp</code> instructions respectively. But you may ask, what is the significance of using <code>_mm_lfence()</code> before and after the measurement? You may also question why we used <code>rdtsc</code> for reading the starting value of the TSC and <code>rdtscp</code> for the ending measurement. To answer these questions, we have to go deeper and think about how the processor executes instructions.</p><h2>Out of Order Execution and Serializing Instructions</h2><p>Let&#8217;s step back a bit and talk about how the CPU executes instructions.</p><p>Modern x86 CPUs do out-of-order execution of the instruction stream to execute multiple instructions in parallel. They do this by looking at a window of instructions in the instruction stream, identifying independent instructions and executing them in parallel. As a result, an instruction that appears later in the program order may execute much earlier than its predecessors.</p><p>For example, imagine an instruction stream as shown in the below snippet. Here, we are interested in measuring the time taken to execute instructions <code>I4</code> to <code>I6</code>, so we have inserted an <code>rdtsc</code> instruction after <code>I3</code> and <code>I6</code>.</p><pre><code>I0, I1, I2, I3, rdtsc, I4, I5, I6, rdtsc,...</code></pre><p>Due to the out-of-order nature of the instruction execution, we cannot guarantee if the <code>rdtsc</code> instructions will execute exactly in the right order. It is possible that the CPU executes the first <code>rdtsc</code> after <code>I1</code>. In that case, our measurement will include the timing of <code>I2</code> and <code>I3</code> as well, which is not what we want.</p><p>We need a way to force the CPU to not execute <code>rdtsc</code> out of its order and also ensure that all the previous instructions have finished executing when it executes <code>rdtsc</code>. This can be achieved by forcing serialization of the instruction stream right before <code>rdtsc</code>, let&#8217;s understand what that means.</p><h3>Serializing the Instruction Stream</h3><p>There are certain instructions in the x86 architecture that force serialization of the instruction stream. Basically, the serializing instruction acts like a barrier. The CPU cannot execute it until all the instructions appearing before it in the program have finished. Also, it cannot begin executing any instruction appearing after the serializing instruction until the serializing instruction has finished.</p><blockquote><p><em>To be precise, a serializing instruction also requires that all the flags, registers and memory modifications must finish before it executes and all the CPU buffers must be drained.</em> </p></blockquote><p>So, if we insert such a serializing instruction before <code>rdtsc</code>, then we can guarantee that the <code>rdtsc</code> instruction will <em>not</em> be executed by the processor out of its actual order.</p><p>There are a few such serializing instructions available in the x86 architecture, such as:</p><ul><li><p><strong>serialize</strong>: Serializes the instruction stream</p></li></ul><ul><li><p><strong>cpuid</strong>: used to identify the CPU model and features</p></li></ul><ul><li><p><strong>iret</strong>: returns control from an interrupt handler back to the interrupted application</p></li></ul><ul><li><p><strong>rsm</strong>: resume from system management mode</p></li></ul><p>Out of these, <code>iret</code> and <code>rsm</code> are control flow modifying instructions, so you cannot use them solely for the purpose of serializing the instruction stream. In the past, <code>cpuid</code> was the recommended instruction for use in combination with <code>rdtsc</code>, and it is still an option today. However, it adds a slight overhead because the CPU needs some work to do to execute it apart from serializing the instruction stream. A much lightweight alternative is the <code>lfence</code> instruction that we saw in the snippet above. <code>lfence</code> is not a proper serializing instruction, but a memory ordering instruction. However, it serves the purpose. Let&#8217;s understand what it does.</p><blockquote><p>We didn&#8217;t consider the <code>serialize</code> instruction because it is only available on Intel processors and missing on AMD. The instruction is purely there for serializing the instruction stream, so it is a good option. Alas, it is not portable.</p></blockquote><h3>The <code>lfence</code> instruction</h3><p>An alternative to using serializing instructions with <code>rdtsc</code> is using memory ordering instructions, such as <code>lfence</code>, <code>sfence</code>, or <code>mfence</code>. These instructions add lesser overhead than pure serializing instructions, such as <code>cpuid</code>. Let&#8217;s understand how.</p>
      <p>
          <a href="https://blog.codingconfessions.com/p/rdtsc">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[My Top 5 Favourite Features in Python 3.14]]></title><description><![CDATA[Exploring the concurrency, debugging, and performance upgrades that make Python 3.14 special.]]></description><link>https://blog.codingconfessions.com/p/python-3-14-whats-new</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/python-3-14-whats-new</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Sat, 11 Oct 2025 08:45:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!YLWy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YLWy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YLWy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!YLWy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!YLWy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!YLWy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YLWy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161192,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/175782380?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YLWy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!YLWy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!YLWy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!YLWy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45241f2c-9f42-468b-be7a-cdebb6e79969_1536x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Pi release of Python (so named because it is version 3.14, matching the digits of &#960;) is finally here. You can go through the list of new features and major changes yourself <a href="https://docs.python.org/3.14/whatsnew/3.14.html">release notes</a>. In this post, I want to go through my top 5 favorite features of this release that I find exciting as a Python programmer and also as an engineer who loves studying system internals.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>I usually write long, in-depth explainers, but today&#8217;s piece is a shorter look at what&#8217;s new in Python 3.14. If you enjoy this mix of quick takes and deep dives, you can support my work by upgrading to a paid plan.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Free Threading Python</h2><p>In practical terms, the free&#8209;threaded build allows Python programs to take advantage of multiple CPU cores concurrently, enabling true parallel execution of threads for compute&#8209;intensive workloads.</p><p>Until Python 3.13, it was not possible to run multiple threads in parallel in Python due to the global interpreter lock (GIL), which is a global mutex inside the Python interpreter. A thread needs to acquire this lock before it can be run on the CPU. It meant that even if you had a large multicore machine, your Python process was still only using a single core. Solutions like <a href="https://docs.python.org/3/library/multiprocessing.html">multiprocessing</a> were created as a workaround this limitation.</p><p>Prior to the Python 3.13 release of Python, <a href="https://peps.python.org/pep-0703/">PEP-703</a> was proposed to make the GIL optional. The PEP proposed a plan to introduce changes so that it would be possible to build a version of Python without the GIL by specifying a build-flag.</p><p>These changes were accepted in the 3.14 release and as a result this release of Python comes with two versions: one with the GIL still there, while the other without the GIL. If you use <a href="https://docs.astral.sh/uv/">uv</a> , you can install the two versions using these commands:</p><pre><code>uv install cpython-3.14.0 #with the GIL
uv install cpython-3.14.0t #without the GIL</code></pre><blockquote><p><em>Note that the free threaded build of Python breaks the ABI and all the third party packages that use the C API of CPython need to be recompiled, so not all the scientific computing packages may be immediately available for use with it.</em> </p></blockquote><h3>Reference Reading</h3><p>The <a href="https://peps.python.org/pep-0703/">PEP-703</a> which describes the work behind removing the GIL is a gread read to understand the challenges behind removing the GIL and how this work has been done.</p><div><hr></div><h2>Concurrent Interpreters</h2><p>A very exciting new feature in the 3.14 release is the introduction of the <a href="https://docs.python.org/3/library/concurrent.interpreters.html">concurrent.interpreters</a> module in the standard library. It allows you to run multiple Python interpreters in parallel within the same Python process. It enables yet another kind of parallelism in Python despite the GIL.</p><p>The actual implementation details behind this are tricky to explain, I will do that in another post. But if you have read my article on <a href="https://blog.codingconfessions.com/p/cpython-runtime-internals">CPython runtime bootstrapping</a>, you might be able to put the pieces together. But here is the executive summary.</p><p>By default, the Python process has one main interpreter and one main thread. But now, you have the ability to create multiple interpreters on demand at runtime using the <code>concurrent.interpreters</code> module. These other interpreters created at runtime are also referred to as <em>subinterpreters</em>. Creating a subinterpreter is as easy as calling the <code>create()</code> function of <code>concurrent.interpreters</code>. </p><pre><code>import concurrent.interpreters
interp1 = concurrent.interpreters.create()</code></pre><p>After the above call, the Python process has two interpreters inside it. Internally, the runtime tracks these using a linked list of <em>interpreter state</em> objects. An interpreter state represents the internal execution state of an interpreter. By providing each interpreter its own interpreter state, the runtime isolates them at Python code execution level.</p><p>To execute code on this new interpreter, we can invoke its <code>call()</code> method. For example:</p><pre><code>&gt;&gt;&gt; def sum(a,b):
...     return a + b
...
&gt;&gt;&gt; interp1.call(sum, 10, 2)
12</code></pre><p>However, this isn&#8217;t parallel execution because there is only one thread running in the Python process. So, the runtime simply switches the thread from executing the code inside the main interpreter to executing code inside the subinterpreter.</p><p>To execute code on the interpreter in its own thread, we can use the <code>call_in_thread()</code> method. Internally, this creates a new thread that executes the code in its own context. This is a non-blocking call and we cannot get the result back. So, to communicate data between interpreters, we have to create a queue using <code>concurrent.interpreters.create_queue()</code> method. Here is an example that puts all of this together.</p><pre><code>&gt;&gt;&gt; def add(q, a, b):
...   q.put(a+b)
...
... interp1 = concurrent.interpreters.create()
... queue = concurrent.interpreters.create_queue()
... t = interp1.call_in_thread(add, queue, 10, 2)
... result = queue.get()
... print(result)
...
12
</code></pre><p>Here, we have created a queue, and passed it to the <code>add</code> function. The <code>add</code> function puts the result in the queue. In the main interpreter, we poll the queue for the result using its <code>get()</code> method, which blocks until there is some data in the queue.</p><p>If you are curious about how all of this works under the hood, let me know and we can cover the internals in a future post.</p><h3>Reference Reading</h3><p>If you want to learn more about the runtime data structures behind this, I recommend the following article:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;665679b3-abeb-4056-a0dd-a53e4b39f7ac&quot;,&quot;caption&quot;:&quot;While this article is freely available to read online, I am also making a PDF of this article available. If you enjoy reading in that format, you can purchase it at the below link. If you are a paid subscriber you can find a 100% discount code in the header of the email, or just reach out to me via email or DM and I will give you the PDF.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;CPython Runtime Internals: Key Data Structures &amp; Runtime Bootstrapping&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-04-26T15:08:37.790Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1599837565318-67429bde7162?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNXx8cHl0aG9uJTIwcHJvY2Vzc3xlbnwwfHx8fDE3MTQxMzk0Nzl8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/cpython-runtime-internals&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143895035,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:36,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h2>Remote Debugging Support</h2><p>Beyond concurrency, Python 3.14 also introduces major improvements in tooling.</p><p>Debugging running Python processes has always been a pain. In order to debug it using a debugger, such as pdb, you need to manually add breakpoints in the code, then restart the process and wait for them to be hit again. In production systems, this can be infeasible.</p><p>The motivation for the new feature is to simplify this experience: with Python 3.14, you can attach to a running process using <code>python -m pdb -p &lt;pid&gt;</code>, eliminating the need to restart it.</p><p>Technically, the CPython interpreter already had provisions to allow remote processes to connect to it and navigate its runtime state. This is how remote profilers, such as <a href="https://github.com/plasma-umass/scalene">scalene</a>, <a href="https://github.com/benfred/py-spy">pyspy</a> and others work. As part of <a href="https://peps.python.org/pep-0768/">PEP-768</a>, this framework has been extended to allow debuggers to connect and debug the Python interpreter.</p><p>A debugger can now attach to a Python process and update specific fields in its runtime data structures to signal that it wants to begin debugging. When the interpreter detects this, it provides a debug prompt where you can set breakpoints and debug as usual.</p><p>While pdb has already been updated to support remote debugging, this framework also exposes an API, <a href="https://docs.python.org/3.14/library/sys.html#sys.remote_exec">sys.remote_exec</a>, so external debuggers can leverage this functionality without needing low-level C integration. </p><h3>Reference Video</h3><p>In a past live session, I talked about how remote profilers work which is exactly how remote debugger implementation has also been done. So, if you are curious, give it a watch.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;fdee18cf-9df8-4cd2-a9b7-ee2d7698e2a8&quot;,&quot;caption&quot;:&quot;Yesterday, we did the live session on the internals of remote sampling profilers. We learned the internals that are required to build such tools. Building these tools is probably one of the most interesting systems programming projects that you can do to not only learn the internals of a programming language, but also learn the ELF file format.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Recording: CPython and ELF Essentials for Building a Basic Remote Profiler&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-06-03T05:23:22.416Z&quot;,&quot;cover_image&quot;:&quot;https://substack-video.s3.amazonaws.com/video_upload/post/145244402/e6506f7e-23b9-4ec7-b8de-4dc5fd85571f/transcoded-00001.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/recording-cpython-and-elf-essentials&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:&quot;e6506f7e-23b9-4ec7-b8de-4dc5fd85571f&quot;,&quot;id&quot;:145244402,&quot;type&quot;:&quot;podcast&quot;,&quot;reaction_count&quot;:18,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h2>Incremental Garbage Collection</h2><p>Complementing the concurrency and debugging improvements discussed earlier, this feature enhances runtime stability and responsiveness by addressing garbage collection performance.</p><p>In a past article, I explained in detail the <a href="https://blog.codingconfessions.com/p/connecting-cpythons-gc-internals">cost of a full heap scan by the garbage collector in CPython</a>. Needless to say it is expensive, and moreover, it also introduces unpredictable latency delays in the performance of your APIs, because when the GC is running, the interpreter does not execute any Python code. Incremental garbage collection makes the GC overhead predictable, resulting in smoother performance for latency-sensitive workloads.</p><p>Let&#8217;s first understand how the GC used to work before this change. There were three collectable generations: young generation, old generation, and the oldest generation. There were configurable thresholds for each generation that would define when the GC would scan each of those generations. For example, the young generation would be scanned once the number of objects in it exceeds 10,000.</p><p>Any object that survives a scan of the young generation gets promoted to the first old generation. The first old generation gets scanned when the young generation has been scanned a configured number of times, such as 10 times. When that happens, the GC scans both the young gen and the first old gen. Any object that survives a scan of the first old generation gets promoted to the 2nd old generation (also known as the oldest generation).</p><p>The oldest generation is scanned when the first old generation has been scanned a configured number of times. When that threshold is reached, the GC performs a full heap scan, i.e. all the three generations. Naturally, this gets expensive.</p><p>Incremental garbage collection improves this. It reduces the number of GC generations to just two: young and old. On each GC cycle, the collector scans the young generation and a fraction of the old generation. This way, the amount of work that the GC does on each cycle becomes consistent and it eliminates those long pauses and latency spikes that were there due to a full heap scan.</p><h3>Reference Reading</h3><p>If you want to read more about CPython&#8217;s garbage collector, I recommend the following articles:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;92aaed4c-771c-4103-99b7-e88cfcf7a469&quot;,&quot;caption&quot;:&quot;We&#8217;ve been talking about CPython internals and in the last article I went quite deep in CPython&#8217;s runtime. One of the crucial services that the runtime provides is that of managing a program&#8217;s memory during execution.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;CPython Garbage Collection: The Internal Mechanics and Algorithms&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-06-11T13:23:25.751Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1503596476-1c12a8ba09a9?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxnYXJiYWdlJTIwY29sbGVjdGlvbnxlbnwwfHx8fDE3MTgxMDU0OTh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/cpython-garbage-collection-internals&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144615668,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:42,&quot;comment_count&quot;:1,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;a218af0c-0055-485e-b72e-28dfaab77d99&quot;,&quot;caption&quot;:&quot;A while back I published a detailed code walkthrough of CPython's GC implementation, but there was a need for a higher level explanation of the overall memory management mechanism of CPython without discussing the code. This article fills that gap. It provides a detailed overview of the overall memory management mechanism in CPython. The main focus is o&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;CPython's Garbage Collector and its Impact on Application Performance&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-10-02T15:58:15.592Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1620043823875-ccda6ea05e78?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw4NXx8Z2FyYmFnZXxlbnwwfHx8fDE3Mjc4NzY3NjJ8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/connecting-cpythons-gc-internals&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:149651253,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:22,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h2>Tail Calling Interpreter</h2><p>Finally, my favorite change as part of this release is the tail calling interpreter. It is a rewrite of the bytecode dispatch loop in the CPython virtual machine and improves performance of Python code execution by ~5%. </p><p>The bytecode dispatch loop is the heart of the interpreter where the bytecode instructions of your compiled Python program are evaluated. The faster this loop runs, the faster your Python program executes, so performance improvement in this are are always very exciting to understand. I have already written a very detailed article on <a href="https://blog.codingconfessions.com/p/cpython-vm-internals">the design and implementation of the dispatch loop in CPython</a>, and I have another article in progress to explain the tail calling interpreter. So, I will be brief here.</p><p>Your Python program gets compiled to a sequence of bytecode instructions. For example, the following snippet shows the bytecode instructions for a single line of code: <code>a + b</code>. So, the bytecode dispatch loop iterates over these instructions one by one and executes them. </p><pre><code>&gt;&gt;&gt; import dis
&gt;&gt;&gt; dis.dis(&#8221;a + b&#8221;)
  0           0 RESUME                   0

  1           2 LOAD_NAME                0 (a)
              4 LOAD_NAME                1 (b)
              6 BINARY_OP                0 (+)
             10 RETURN_VALUE
</code></pre><p>The most obvious way of writing this loop is using a switch case. The problem with that is that Python has hundreds of bytecode instructions, so this switch case is huge. Optimizing such large functions is hard for compilers. For example, it cannot allocate registers optimally and some of the key variables can get spilled onto the stack, resulting in poor performance.</p><blockquote><p><em>CPython also has a computed goto based implementation of the dispatch loop but that also suffers from the same problem. If you are not familiar with computed goto based dispatch loop, read my article on <a href="https://blog.codingconfessions.com/p/cpython-vm-internals">the design and implementation of the CPython dispatch loop</a>.</em></p></blockquote><p>The tail calling interpreter solves this by separating the implementation of each bytecode instruction into an individual function. For example, there is one function for handling LOAD_NAME, another for BINARY_OP, and so on.</p><p>This implementation is called tail calling interpreter because of the way these functions are written. At their end, instead of returning, these functions call the function for the next bytecode instruction. They do this by looking up a function pointer table using the next bytecode instruction as an index. The signature and return value of each of these functions is identical, and because these calls occur at the end of the function, they are tail calls. The compiler can optimize these tail calls and convert them into jumps, which avoids the overhead of function calls.</p><p>This implementation improves performance due to one fundamental reasons:</p><div class="pullquote"><p><em>It results in small functions for handling each bytecode instruction that the compiler can optimize much better and do optimal register allocation.</em></p></div><p>Overall, this has shown improvement over the previous switch case and computed goto based implementations. However, it requires compiler support for performing tail call optimization which is not present in all compilers. As a result, right now the feature is opt-in only and you need to build CPython from source using a supported compiler, such as clang 19. </p><h3>Reference Reading</h3><p>If you want to understand the internals of the CPython bytecode interpreter and the dispatch loop, read the following article:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;afa0fbf7-86d6-4fa9-8b22-b1e387565694&quot;,&quot;caption&quot;:&quot;For every bytecode compiled language, the most interesting part of its implementation is its virtual machine (also referred to as the bytecode interpreter) where the bytecode execution takes place. Because this is such a crucial part of the language machinery, its implementation has to be highly performant. Even if you are not a compiler engineer, learning about such internal implementation can give you new performance tricks and insights that you may be able to use in other places of your job. And, if you are a compiler engineer then you should always look around how other languages are implemented to pickup implementation details that you may not be aware of.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Design &amp; Implementation of the CPython Virtual Machine&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-08-31T14:35:14.115Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1504639725590-34d0984388bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8dmlydHVhbCUyMG1hY2hpbmV8ZW58MHx8fHwxNzI1MDI0MzE1fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/cpython-vm-internals&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143567425,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:46,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h2>Wrapping Up</h2><p>Although there are many other new features and improvements in this release of Python, I picked these because of my interest in Python internals and performance. Apart from that, changes such as remote debugger and GIL removal are also very exciting to understand from an engineering point of view. Studying these can give you insights that can help you improve as an engineer. </p><p>I have plans to write about some of these in future posts. But if you would like me to cover something specific, let me know.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>If you enjoyed this dive into Python 3.14, consider becoming a paid subscriber, it helps me keep sharing more of these focused explorations of Python internals.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/python-3-14-whats-new?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/python-3-14-whats-new?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="poll-embed" data-attrs="{&quot;id&quot;:388745}" data-component-name="PollToDOM"></div><p></p>]]></content:encoded></item><item><title><![CDATA[Understanding Weak References in Python]]></title><description><![CDATA[Understanding Python&#8217;s memory management with weak references]]></description><link>https://blog.codingconfessions.com/p/a-strong-reference-to-weak-references</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/a-strong-reference-to-weak-references</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Tue, 30 Sep 2025 15:01:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_lZX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_lZX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_lZX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_lZX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_lZX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_lZX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_lZX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120448,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172320360?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_lZX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_lZX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_lZX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_lZX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbe8ea5e-8fd1-494e-aeae-ba0d0c961b3f_1536x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cover: Strong reference vs Weak Reference</figcaption></figure></div><p>When working with Python (and many other languages), you often rely on the runtime to manage memory for you. Most of the time this works invisibly, but certain patterns such as objects that reference each other in cycles, long lived caches, or subscriber lists can create memory leaks if not handled carefully.</p><p>This happens because Python always creates strong references to objects, which means the object will be kept alive as long as all such strong references exist in the program. But when used in cyclic data structure, or in caches, these strong references can unnecessarily delay the deallocation of these objects.</p><p>Weak references provide a way to refer to objects without preventing them from being garbage collected. They let you build caches that automatically empty, subscriber lists that clean themselves up, and other data structures that will not accidentally extend object lifetimes.</p><p>In this article we will explore what weak references are, why they matter, and how to use them in Python. We will start with a review of reference counting, look at its limitations, and then dive into weak references and their practical uses.</p><p></p><div><hr></div><h3><strong>CodeRabbit: Free AI Code Reviews in CLI (</strong><em><strong>Sponsored</strong></em><strong>)</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://coderabbit.link/fIVg8LI" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hJCC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 424w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 848w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 1272w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hJCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png" width="1456" height="879" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:879,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://coderabbit.link/fIVg8LI&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code" title="CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code" srcset="https://substackcdn.com/image/fetch/$s_!hJCC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 424w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 848w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 1272w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code</figcaption></figure></div><p>As developers increasingly turn to CLI coding agents like Claude Code for rapid development, a critical gap emerges: who reviews the AI-generated code? CodeRabbit CLI fills this void by delivering senior-level code reviews directly in your terminal, creating a seamless workflow where code generation flows directly into automated validation. Review uncommitted changes, catch AI hallucinations, and get one-click fixes - all without leaving your command line. It&#8217;s the quality gate that makes autonomous coding truly possible, ensuring every line of AI-generated code meets production standards before it ships.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://coderabbit.link/fIVg8LI&quot;,&quot;text&quot;:&quot;Get Started Today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://coderabbit.link/fIVg8LI"><span>Get Started Today</span></a></p><div><hr></div><h2>A Review of Reference Counting</h2><p>Many languages either use reference counting as a mechanism to manage runtime memory or they provide first class primitives to do use reference counting. </p><p>In this scheme, every object has an associated reference count which means the number of places it is being used. For example, when you create an object and assign it to a variable it will have a reference count of 1. When you assign it to another variable or pass it to another function, its reference count will go up by 1.</p><p>Similarly, when a variable goes out of scope, or a function call returns then its reference count gets decremented. If the reference count of the object reaches 0, it gets deallocated or garbage collected.</p><p>CPython uses reference counting for managing the memory of its runtime. But other languages also offer it as well. For example, in C++ or rust when you use a smart pointer, it uses reference counting under the hood, the compiler generates code that increments and decrements the reference count of the objects.</p><p><em>If you want to understand how CPython implements reference counting internally, you can check out my article on that topic:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d6b010ff-833b-4c87-b257-d6c3dd451b8d&quot;,&quot;caption&quot;:&quot;This week we are diverting from AI and machine learning to discuss a more intense CS topic &#8212; memory management in Python. Memory management refers to the techniques used by the programming language runtime to allocate and free memory as programs execute. Understanding how memory management functions in a language is crucial to writing efficient and high&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How CPython Implements Reference Counting: Dissecting CPython Internals &quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2023-08-16T17:57:20.848Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1624953587687-daf255b6b80a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxweXRob258ZW58MHx8fHwxNjkyMjU0ODk0fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/cpython-reference-counting-internals&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:135935087,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:35,&quot;comment_count&quot;:12,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h2>Limitations of Reference Counting</h2><p>Reference counting works well for most cases, but it is not a complete solution. Its simplicity comes with trade&#8209;offs, and understanding these limitations helps motivate why Python also offers weak references.</p><p>One of those limitations is cyclic references. Cyclic references exist when objects hold references to each other in a cycle, e.g. in a graph data structure. But you can also end up creating cyclic references accidentally in complex systems. In such cases, the objects that are part of the cycle will never get freed until the cycle is broken. This is why CPython also implements a cycle breaking garbage collector (GC) that runs periodically, scans the objects for cycles and if it detects cycles that are no longer referenced from anywhere else, then it breaks them so that those objects can be freed. </p><p>Cyclic references can be problematic for performance because memory usage remains high until the GC runs, and the GC scan itself can be expensive (depending on the number of objects it needs to scan). </p><p>We can understand this with the help of an example. Consider the following code</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sw9R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sw9R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png 424w, https://substackcdn.com/image/fetch/$s_!Sw9R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png 848w, https://substackcdn.com/image/fetch/$s_!Sw9R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png 1272w, https://substackcdn.com/image/fetch/$s_!Sw9R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sw9R!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png" width="1200" height="629.278951201748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:720,&quot;width&quot;:1373,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:160107,&quot;alt&quot;:&quot;import gc import sys   class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f\&quot;{self.name} is being deleted\&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f\&quot;{o.name} exists with referrers:                 {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}\&quot;)     if obj_count == 0:         print(\&quot;No MyNode objects found\&quot;)  def test1():     n1 = MyNode(\&quot;n1\&quot;)     n2 = MyNode(\&quot;n2\&quot;)     print(f\&quot;n1 refcount: {sys.getrefcount(n1)}\&quot;)     print(f\&quot;n2 refcount: {sys.getrefcount(n2)}\&quot;)   if __name__ == '__main__':     test1()     print_node_objects()&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172320360?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="import gc import sys   class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f&quot;{self.name} is being deleted&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f&quot;{o.name} exists with referrers:                 {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}&quot;)     if obj_count == 0:         print(&quot;No MyNode objects found&quot;)  def test1():     n1 = MyNode(&quot;n1&quot;)     n2 = MyNode(&quot;n2&quot;)     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)     print(f&quot;n2 refcount: {sys.getrefcount(n2)}&quot;)   if __name__ == '__main__':     test1()     print_node_objects()" title="import gc import sys   class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f&quot;{self.name} is being deleted&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f&quot;{o.name} exists with referrers:                 {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}&quot;)     if obj_count == 0:         print(&quot;No MyNode objects found&quot;)  def test1():     n1 = MyNode(&quot;n1&quot;)     n2 = MyNode(&quot;n2&quot;)     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)     print(f&quot;n2 refcount: {sys.getrefcount(n2)}&quot;)   if __name__ == '__main__':     test1()     print_node_objects()" srcset="https://substackcdn.com/image/fetch/$s_!Sw9R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png 424w, https://substackcdn.com/image/fetch/$s_!Sw9R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png 848w, https://substackcdn.com/image/fetch/$s_!Sw9R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png 1272w, https://substackcdn.com/image/fetch/$s_!Sw9R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61bff9c-0723-4dee-b874-8a2d0f66dfc4_1373x720.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Demonstration of how reference counting works in Python</figcaption></figure></div><p></p><p>Let&#8217;s break it down:</p><ul><li><p>The <code>MyNode</code> class implements a linked list node with a next field.</p></li><li><p><code>print_node_objects</code> is a utility function. It finds all the <code>MyNode</code> objects that are currently alive and then prints their referrers, i.e., who is holding a reference to them.</p><ul><li><p>It uses <code>gc.get_objects()</code> to get the list of all the currently alive objects in the Python interpreter and filters it down by checking for their type and selecting only <code>MyNode</code> type objects.</p></li><li><p>It finds the referrers to an object by using the <code>gc.get_referrers()</code> method which returns a list of referrer objects. We are filtering this list by type because during the call, the gc module itself becomes a referrer and we want to filter it away.</p></li></ul></li><li><p>In the main function we call the <code>test1()</code> function that creates two <code>MyNode</code> objects, prints their reference counts and returns. After returning from <code>test1</code>, we call <code>print_node_objects()</code> to see if there are any <code>MyNode</code> type objects that are still alive.</p></li></ul><p>If you run this program, you should see an output like the following:</p><pre><code>&#10140; uv run --python 3.13 --  cycles.py
n1 refcount: 2
n2 refcount: 2
n1 is being deleted
n2 is being deleted
No MyNode objects found
</code></pre><p>This is pretty much the expected output, but let&#8217;s spend a moment to ensure we don&#8217;t miss anything.</p><ul><li><p>We see that the reference count for both <code>n1</code> and <code>n2</code> is 2. You might expect it to be 1 but it is 2 because during the call to <code>sys.getrefcount</code>, the object&#8217;s reference count gets incremented. </p></li><li><p>We see that the <code>__del__</code> method of both the object gets called and prints a message. This happens because <code>n1</code> and <code>n2</code> are local variables inside <code>test1(),</code> and when it returns, its stack frame gets destroyed which results in the reference counts of all of its local objects (parameters and locally created variables) being decremented. In this case, because <code>n1</code> and <code>n2</code> reached reference count 0, they were deallocated and their <code>__del__</code> method was called.</p></li><li><p>Finally, in <code>main()</code>, when <code>print_node_objects()</code> is called, we see that it does not find any <code>MyNode</code> objects on the heap that are still alive.</p></li></ul><p>Next, we can do another test that creates a cycle between <code>n1</code> and <code>n2</code> and see that the objects stay alive after the return from the test function. The following figure shows the updated code where I&#8217;ve added a new function <code>test2()</code> and then calling it from main.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8ocV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8ocV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png 424w, https://substackcdn.com/image/fetch/$s_!8ocV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png 848w, https://substackcdn.com/image/fetch/$s_!8ocV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png 1272w, https://substackcdn.com/image/fetch/$s_!8ocV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8ocV!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png" width="1200" height="1021.2765957446809" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/04a29c40-1783-4914-8226-a15496232c18_1081x920.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:920,&quot;width&quot;:1081,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:135513,&quot;alt&quot;:&quot;#!/usr/bin/env python  import gc import sys  class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f\&quot;{self.name} is being deleted\&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f\&quot;{o.name} exists with referrers: {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}\&quot;)     if obj_count == 0:         print(\&quot;No MyNode objects found\&quot;)  def test2():     n1 = MyNode(\&quot;n1\&quot;)     n2 = MyNode(\&quot;n2\&quot;)      n1.next = n2     n2.next = n1     print(f\&quot;n1 refcount: {sys.getrefcount(n1)}\&quot;)     print(f\&quot;n2 refcount: {sys.getrefcount(n2)}\&quot;)  def test1():     n1 = MyNode(\&quot;n1\&quot;)     n2 = MyNode(\&quot;n2\&quot;)     print(f\&quot;n1 refcount: {sys.getrefcount(n1)}\&quot;)     print(f\&quot;n2 refcount: {sys.getrefcount(n2)}\&quot;)   if __name__ == '__main__':     test1()     print_node_objects()     print(\&quot;---------------------\&quot;)     test2()     print_node_objects()&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172320360?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="#!/usr/bin/env python  import gc import sys  class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f&quot;{self.name} is being deleted&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f&quot;{o.name} exists with referrers: {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}&quot;)     if obj_count == 0:         print(&quot;No MyNode objects found&quot;)  def test2():     n1 = MyNode(&quot;n1&quot;)     n2 = MyNode(&quot;n2&quot;)      n1.next = n2     n2.next = n1     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)     print(f&quot;n2 refcount: {sys.getrefcount(n2)}&quot;)  def test1():     n1 = MyNode(&quot;n1&quot;)     n2 = MyNode(&quot;n2&quot;)     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)     print(f&quot;n2 refcount: {sys.getrefcount(n2)}&quot;)   if __name__ == '__main__':     test1()     print_node_objects()     print(&quot;---------------------&quot;)     test2()     print_node_objects()" title="#!/usr/bin/env python  import gc import sys  class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f&quot;{self.name} is being deleted&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f&quot;{o.name} exists with referrers: {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}&quot;)     if obj_count == 0:         print(&quot;No MyNode objects found&quot;)  def test2():     n1 = MyNode(&quot;n1&quot;)     n2 = MyNode(&quot;n2&quot;)      n1.next = n2     n2.next = n1     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)     print(f&quot;n2 refcount: {sys.getrefcount(n2)}&quot;)  def test1():     n1 = MyNode(&quot;n1&quot;)     n2 = MyNode(&quot;n2&quot;)     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)     print(f&quot;n2 refcount: {sys.getrefcount(n2)}&quot;)   if __name__ == '__main__':     test1()     print_node_objects()     print(&quot;---------------------&quot;)     test2()     print_node_objects()" srcset="https://substackcdn.com/image/fetch/$s_!8ocV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png 424w, https://substackcdn.com/image/fetch/$s_!8ocV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png 848w, https://substackcdn.com/image/fetch/$s_!8ocV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png 1272w, https://substackcdn.com/image/fetch/$s_!8ocV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04a29c40-1783-4914-8226-a15496232c18_1081x920.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Demonstration of cyclic references. In test2() function we create cycle between n1 and n2 and see that they are left alive even after test2 returns.</figcaption></figure></div><p>If we run this program, we should see the following output:</p><pre><code>&#10140; uv run --python 3.13 --  cycles.py
n1 refcount: 2
n2 refcount: 2
n1 is being deleted
n2 is being deleted
No MyNode objects found
---------------------
n1 refcount: 3
n2 refcount: 3
n1 exists with referrers: [&#8217;n2&#8217;]
n2 exists with referrers: [&#8217;n1&#8217;]
n1 is being deleted
n2 is being deleted
</code></pre><p>Let&#8217;s focus on the output after the call to <code>test2()</code>. </p><ul><li><p>We see that in <code>test2()</code>, the reference count for <code>n1</code> and <code>n2</code> is 3, one higher than what it was in <code>test1()</code>. This is due to <code>n1.next</code> creating a reference to <code>n2</code> and <code>n2.next</code> creating a reference to <code>n1</code>. </p></li><li><p>We also see that when <code>test2()</code> returns, the <code>__del__</code> method of <code>n1</code> and <code>n2</code> is not called, it means that those objects are not deallocated and are still alive. This happened because during the return, the interpreter would decrement their reference count but this time the reference count does not reach 0.</p></li><li><p>After return from <code>test2()</code>, when we call <code>print_node_objects()</code>, we see that it tells us that the <code>MyNode</code> objects we created for <code>n1</code> and <code>n2</code> are still alive. We can also see that they are alive because they are holding cyclic reference to each other.</p></li><li><p><code>n1</code> and <code>n2</code> finally get destroyed as the program ends because the CPython interpreter runs the GC before shutting down. </p></li></ul><p>To avoid such cyclic references from leaking memory, CPython includes a garbage collector that periodically runs, detects cycles that are no longer from anywhere else, and breaks them so that the objects that are part of the cycle can get deallocated. You can verify it yourself by inserting a <code>gc.collect()</code> call after the call to <code>test2()</code> in the above program.</p><p><em>If you want to understand how the CPython garbage collector detects and breaks cycles, read my article on its internals:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1fbc716c-7adb-4138-b552-ca1b125dab36&quot;,&quot;caption&quot;:&quot;We&#8217;ve been talking about CPython internals and in the last article I went quite deep in CPython&#8217;s runtime. One of the crucial services that the runtime provides is that of managing a program&#8217;s memory during execution.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;CPython Garbage Collection: The Internal Mechanics and Algorithms&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-06-11T13:23:25.751Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1503596476-1c12a8ba09a9?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxnYXJiYWdlJTIwY29sbGVjdGlvbnxlbnwwfHx8fDE3MTgxMDU0OTh8MA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/cpython-garbage-collection-internals&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144615668,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:41,&quot;comment_count&quot;:1,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><p>However, there are other ways to avoid such pitfalls of reference counting and weak references is one of them. Let&#8217;s understand what they are and how they work.</p><div><hr></div><h2>Understanding Weak References</h2><p>Weak references are on the opposite spectrum of strong references. A weak reference does not increase the reference count of the underlying object, so it enables you to use an object without prolonging the lifetime of the object.</p><p>When the object&#8217;s reference count goes to 0, it can get deallocated even if there are weak references to it that are still being used. Naturally, this requires that when using a weak reference to an object, we always need to check if the underlying object is still alive.</p><p>In Python, to create weak references, we need to use the <code>weakref.ref()</code> function from the <a href="https://docs.python.org/3/library/weakref.html">weakref module</a> and pass the object for which we want to create a weak reference. For example:</p><pre><code>n1_weakref = weakref.ref(n1)</code></pre><p><code>weakref.ref()</code> creates a weak reference to the given object and returns us a callable. To access the underlying object we need to invoke this callable everytime. If the object is still alive, it returns a handle to the object, otherwise it returns <code>None</code>. For example:</p><pre><code>if n1_weakref():
  print(f"name: {n1_weakref().name}")
else:
  print("n1 no longer exists")</code></pre><p>The following figure shows a full example of creating a weak reference and accessing it in our running linked list example.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CW5b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CW5b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png 424w, https://substackcdn.com/image/fetch/$s_!CW5b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png 848w, https://substackcdn.com/image/fetch/$s_!CW5b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png 1272w, https://substackcdn.com/image/fetch/$s_!CW5b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CW5b!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png" width="1200" height="942.4929178470255" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1109,&quot;width&quot;:1412,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:209868,&quot;alt&quot;:&quot;#!/usr/bin/env python  import gc import sys import weakref  class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f\&quot;{self.name} is being deleted\&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f\&quot;{o.name} exists with referrers: {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}\&quot;)     if obj_count == 0:         print(\&quot;No MyNode objects found\&quot;)  def weakref_demo():     n1 = MyNode(\&quot;n1\&quot;)     print(f\&quot;n1 refcount: {sys.getrefcount(n1)}\&quot;)     # use the ref function to create a weak reference to n1     # it gives us a callable that when called will try     # to access the underying object     n1_weakref = weakref.ref(n1)       # notice how n1's reference count remains unchanged     print(f\&quot;n1 refcount: {sys.getrefcount(n1)}\&quot;)      # to access n1 using weakref we need to call it     if n1_weakref():         print(f\&quot;n1's name: {n1_weakref().name}\&quot;)      # let's delete n1 and see if weakref still works     del n1     if n1_weakref():         print(f\&quot;n1's name: {n1_weakref()}\&quot;)     else:         print(\&quot;n1 no longer exists\&quot;)     if __name__ == '__main__':     weakref_demo()     print(\&quot;---------------------\&quot;)     print_node_objects()&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172320360?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="#!/usr/bin/env python  import gc import sys import weakref  class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f&quot;{self.name} is being deleted&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f&quot;{o.name} exists with referrers: {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}&quot;)     if obj_count == 0:         print(&quot;No MyNode objects found&quot;)  def weakref_demo():     n1 = MyNode(&quot;n1&quot;)     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)     # use the ref function to create a weak reference to n1     # it gives us a callable that when called will try     # to access the underying object     n1_weakref = weakref.ref(n1)       # notice how n1's reference count remains unchanged     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)      # to access n1 using weakref we need to call it     if n1_weakref():         print(f&quot;n1's name: {n1_weakref().name}&quot;)      # let's delete n1 and see if weakref still works     del n1     if n1_weakref():         print(f&quot;n1's name: {n1_weakref()}&quot;)     else:         print(&quot;n1 no longer exists&quot;)     if __name__ == '__main__':     weakref_demo()     print(&quot;---------------------&quot;)     print_node_objects()" title="#!/usr/bin/env python  import gc import sys import weakref  class MyNode:     def __init__(self, name: str):         self.name: str = name         self.next = None      def __del__(self):         print(f&quot;{self.name} is being deleted&quot;)  def print_node_objects():     obj_count = 0     for o in gc.get_objects():         if type(o) is MyNode:             obj_count += 1             print(f&quot;{o.name} exists with referrers: {[n.name for n in gc.get_referrers(o) if type(n) is MyNode]}&quot;)     if obj_count == 0:         print(&quot;No MyNode objects found&quot;)  def weakref_demo():     n1 = MyNode(&quot;n1&quot;)     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)     # use the ref function to create a weak reference to n1     # it gives us a callable that when called will try     # to access the underying object     n1_weakref = weakref.ref(n1)       # notice how n1's reference count remains unchanged     print(f&quot;n1 refcount: {sys.getrefcount(n1)}&quot;)      # to access n1 using weakref we need to call it     if n1_weakref():         print(f&quot;n1's name: {n1_weakref().name}&quot;)      # let's delete n1 and see if weakref still works     del n1     if n1_weakref():         print(f&quot;n1's name: {n1_weakref()}&quot;)     else:         print(&quot;n1 no longer exists&quot;)     if __name__ == '__main__':     weakref_demo()     print(&quot;---------------------&quot;)     print_node_objects()" srcset="https://substackcdn.com/image/fetch/$s_!CW5b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png 424w, https://substackcdn.com/image/fetch/$s_!CW5b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png 848w, https://substackcdn.com/image/fetch/$s_!CW5b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png 1272w, https://substackcdn.com/image/fetch/$s_!CW5b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53de04e-14c5-4829-bfd0-96b427ce2113_1412x1109.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Demonstration of creating weak reference and using it</figcaption></figure></div><p><strong>Output:</strong></p><pre><code>&#10140; uv run --python 3.13 --  weakref_cycles.py
n1 refcount: 2
n1 refcount: 2
n1&#8217;s name: n1
n1 is being deleted
n1 no longer exists
---------------------
No MyNode objects found
</code></pre><p>From the output we can confirm a few things:</p><ul><li><p>Creating a weak reference does not increase the object&#8217;s reference count</p></li><li><p>A weak reference does not prevent the object from being deallocated if its reference count goes to 0 (in the example we deleted n1 and after that we were not able to access it using the weak reference.).</p></li></ul><p>I leave the problem of fixing the cyclic reference that we created in <code>test2()</code> as an exercise for you.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Writing these deep dives takes 100+ hours of work. If you find this valuable and insightful, please consider upgrading to a paid subscription to keep this work alive.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Other Use Cases of Weak References</h2><p>So far we&#8217;ve seen weak references as a tool for avoiding cycles, but their utility goes well beyond that. The <code>weakref</code> module also provides ready-made containers built on top of weak references. These containers, <code>WeakValueDictionary</code> and <code>WeakSet</code>, help you manage auxiliary data structures that should not extend the lifetimes of their contents. They solve practical problems such as caching, registries, and subscriber lists, where automatic cleanup is not just convenient but essential for avoiding leaks.</p><h3>WeakValueDictionary</h3><p>The <code>weakref</code> module provides <code>WeakValueDictionary</code>, which looks and behaves like a normal dictionary but with an important twist: its values are held only through weak references. If a value is no longer strongly referenced anywhere else, the dictionary entry disappears automatically.</p><p>This makes <code>WeakValueDictionary</code> a natural fit for <strong>caching and memoization</strong>. Imagine you compute expensive results or load large data structures and want to reuse them if they are still in memory. At the same time, you don&#8217;t want the cache itself to keep them alive forever. A <code>WeakValueDictionary</code> strikes that balance: it holds onto results <em>only as long as the rest of the program does</em>.</p><p>Another classic application is <strong>object interning</strong> or registries. For example, you may want to ensure there is only one canonical object representing a resource (like a symbol table entry, database connection, or parsed schema). By using a <code>WeakValueDictionary</code>, you avoid artificially extending the lifetimes of those resources.</p><p>Here&#8217;s a simple illustration:</p><pre><code>import weakref

class Data:
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return f"Data({self.name})"

cache = weakref.WeakValueDictionary()
obj = Data("expensive_result")
cache["key"] = obj

print("Before deletion:", dict(cache))

# Drop the strong reference
obj = None

print("After deletion:", dict(cache))</code></pre><p><strong>Output:</strong></p><pre><code>Before deletion: {'key': Data(expensive_result)}
After deletion: {}</code></pre><p>Notice how the cache entry vanishes automatically once the last strong reference goes away. There is no need for manual cleanup. Under the hood, this is implemented with weakref callbacks&#8212;the same mechanism we&#8217;ll see in the callback section.</p><h3>WeakSet</h3><p>Another container provided by the <code>weakref</code> module is <code>WeakSet</code>. This is similar to a regular <code>set</code>, except that it holds weak references to its elements. If an object is garbage collected, it will automatically vanish from the set.</p><p>One scenario where this is very handy is when you want to keep track of <em>subscribers</em>, <em>observers</em>, or <em>listeners</em>. These are objects that register interest in events produced by another object (often called the <em>publisher</em>). For instance:</p><ul><li><p><strong>GUI frameworks</strong>: widgets listen to events such as theme changes or window resizes.</p></li><li><p><strong>Event buses</strong>: services subscribe to log events, metrics, or domain events.</p></li><li><p><strong>Plugin systems</strong>: plugins register callbacks at load time to respond to hooks.</p></li><li><p><strong>Background services</strong>: transient sessions (e.g., WebSocket connections) listen for updates from a long&#8209;lived manager.</p></li></ul><p>In all these cases, subscribers are often short&#8209;lived, while the publisher lives much longer. Using a regular <code>set</code> to hold them risks memory leaks, because a strong reference in the set will keep the subscriber alive even when the rest of the program has forgotten it. With a <code>WeakSet</code>, the garbage collector automatically removes subscribers that are no longer strongly referenced anywhere else, so you don&#8217;t need explicit unsubscribe logic in every shutdown path.</p><p>Here&#8217;s a simple example:</p><pre><code>import weakref

class Listener:
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return f"Listener({self.name})"

listeners = weakref.WeakSet()

l1 = Listener("A")
l2 = Listener("B")
listeners.add(l1)
listeners.add(l2)

print("Before deletion:", list(listeners))

# Remove one listener
l1 = None
import gc; gc.collect()

print("After deletion:", list(listeners))</code></pre><p><strong>Output:</strong></p><pre><code>Before deletion: [Listener(A), Listener(B)]
After deletion: [Listener(B)]</code></pre><p>This pattern is often extended into a publisher&#8211;subscriber model:</p><pre><code>class Publisher:
    def __init__(self):
        self._subs = weakref.WeakSet()
    def subscribe(self, sub):
        self._subs.add(sub)
    def notify(self, payload):
        for s in list(self._subs):
            s.handle(payload)

class Subscriber:
    def __init__(self, name):
        self.name = name
    def handle(self, payload):
        print(self.name, "got:", payload)

pub = Publisher()
sub = Subscriber("one")
pub.subscribe(sub)

pub.notify({"event": 1})  # delivered
sub = None                  # drop last strong ref
import gc; gc.collect()

pub.notify({"event": 2})  # nothing printed; WeakSet cleaned itself</code></pre><p>Using <code>WeakSet</code> here avoids leaks and simplifies lifecycle management. A caveat is that only weak&#8209;referenceable objects (i.e., user&#8209;defined classes) can be added; built&#8209;ins like <code>int</code> or <code>tuple</code> won&#8217;t work. If your class uses <code>__slots__</code>, include <code>__weakref__</code> to allow weak references.</p><h3>Callbacks on Weak References</h3><p>Another useful feature of <code>weakref.ref</code> is the ability to attach a <strong>callback</strong>. A callback is a function that gets invoked automatically when the referent object is about to be finalized. This can be handy if you want to clean up auxiliary data structures or release resources when an object goes away.</p><pre><code>import weakref

class Resource:
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return f"Resource({self.name})"

def on_finalize(wr):
    print("Resource has been garbage collected:", wr)

obj = Resource("temp")
wr = weakref.ref(obj, on_finalize)

print("Created weak reference:", wr)

# Drop strong reference
obj = None

# Force GC for demo purposes
import gc; gc.collect()</code></pre><p><strong>Output:</strong></p><pre><code>Created weak reference: &lt;weakref at 0x75f6773870b0; to &#8216;Resource&#8217; at 0x75f677c4ee40&gt;
Resource has been garbage collected: &lt;weakref at 0x75f6773870b0; dead&gt;
</code></pre><p>Here, the <code>on_finalize</code> callback is called once the <code>Resource</code> instance is about to be collected. The weak reference itself becomes dead afterwards. This pattern is useful when you want to implement custom cleanup logic tied to an object&#8217;s lifecycle.</p><p>It&#8217;s also worth noting that containers like <code>WeakValueDictionary</code> and <code>WeakSet</code> use this same mechanism internally: they attach callbacks to their weak references so that entries are automatically removed when the referent objects are finalized.</p><h2>Conclusion</h2><p>Weak references are not a tool you&#8217;ll reach for every day, but when you need them they solve very real problems. At the lowest level, <code>weakref.ref</code> lets you point to an object without affecting its lifetime, and you can even attach a callback to run cleanup code at the moment it is collected. Building on that primitive, Python&#8217;s <code>WeakValueDictionary</code> and <code>WeakSet</code> give you higher level containers for caches, registries, and subscriber lists that automatically clean themselves up when their contents go away.</p><p>To summarize the differences:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jaFp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jaFp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png 424w, https://substackcdn.com/image/fetch/$s_!jaFp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png 848w, https://substackcdn.com/image/fetch/$s_!jaFp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png 1272w, https://substackcdn.com/image/fetch/$s_!jaFp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jaFp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png" width="1367" height="212" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:212,&quot;width&quot;:1367,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43498,&quot;alt&quot;:&quot;A summary of the key APIs from the weakref module in Python and in which situation to use them&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172320360?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A summary of the key APIs from the weakref module in Python and in which situation to use them" title="A summary of the key APIs from the weakref module in Python and in which situation to use them" srcset="https://substackcdn.com/image/fetch/$s_!jaFp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png 424w, https://substackcdn.com/image/fetch/$s_!jaFp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png 848w, https://substackcdn.com/image/fetch/$s_!jaFp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png 1272w, https://substackcdn.com/image/fetch/$s_!jaFp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fda1c6b-43e6-4308-bd64-95fb9f503848_1367x212.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">A summary of the key APIs from the weakref module in Python and in which situation to use them</figcaption></figure></div><p>Together, these features make it possible to build memory&#8209;friendly systems that avoid leaks, reduce bookkeeping, and respect the natural lifetimes of your objects. Understanding weak references and knowing when to apply them will help you write code that is both safer and more efficient.</p><h2>Further Reading</h2><ul><li><p><a href="https://docs.python.org/3/library/weakref.html">Python documentation on </a><code>weakref</code></p></li><li><p><a href="https://docs.python.org/3/library/gc.html">Python garbage collector documentation</a></p></li><li><p>&#8220;Fluent Python&#8221; by Luciano Ramalho &#8211; includes in depth coverage of weak references and how to use them</p></li></ul><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/a-strong-reference-to-weak-references?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/a-strong-reference-to-weak-references?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Confessions of a Code Addict is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Compiling Python to Run Anywhere]]></title><description><![CDATA[A guest post on building a Python compiler that generates optimized kernels while preserving the language&#8217;s simplicity.]]></description><link>https://blog.codingconfessions.com/p/compiling-python-to-run-anywhere</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/compiling-python-to-run-anywhere</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Tue, 23 Sep 2025 17:29:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8897f5b3-cb0b-488f-865d-33b682e1c282_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Foreword</h2><p>A recurring theme of this newsletter is going under the hood: how interpreters, compilers, and runtimes actually work, and what performance trade&#8209;offs they force on us. Python is a perfect case study: it&#8217;s beloved for its simplicity, but that same simplicity often means poor performance when the workloads get serious.</p><p>That&#8217;s why I&#8217;m really excited to share this guest post by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Yusuf Olokoba&quot;,&quot;id&quot;:103669569,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!8ju6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24cbc42d-ea2e-4535-af53-ade33bca9fbb_1024x1024.png&quot;,&quot;uuid&quot;:&quot;ff103787-1689-481f-9f11-4180909a0eff&quot;}" data-component-name="MentionToDOM"></span>, founder of <a href="https://muna.ai">Muna</a>. In this piece, he looks at how Python could be pushed beyond its usual limits of speed and portability, laying out a compiler that turns ordinary code into fast, portable executables.</p><p>Instead of building another JIT or rewriting everything in C++, his approach generates optimized kernels while keeping the Python source unchanged. This ties directly to themes I&#8217;ve written about before, such as CPython internals, and performance engineering. All of those pieces showed why understanding systems at the lowest level matters. Yusuf&#8217;s work demonstrates the payoff of that mindset: the ability to design and build new systems on top of that knowledge.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Confessions of a Code Addict is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Introduction</h2><p>I first met Abhinav at the start of 2024, trying to learn more about how the Python interpreter worked under the hood. I had reached out to Abhinav with a singular goal in mind: to build something that could compile pristine Python code into cross-platform machine code.</p><p>This idea has been attempted in many forms before: runtimes (<a href="https://www.jython.org/">Jython</a>, <a href="https://github.com/RustPython/RustPython">RustPython</a>), DSLs (<a href="https://numba.pydata.org/">Numba</a>, <a href="https://pytorch.org/">PyTorch</a>), and even entirely new programming languages (<a href="https://www.modular.com/mojo">Mojo</a>). But for reasons we will explore later in this article, we needed something that could:</p><ol><li><p>Compile Python entirely ahead-of-time, with no modifications.</p></li><li><p>Run without a Python interpreter, or anything other interpreter.</p></li><li><p>Run with minimal overhead compared to a raw C or C++ program.</p></li><li><p>Most importantly, run anywhere&#8212;server, desktop, mobile, and web.</p></li></ol><p>In this article, I will walk through how this seemingly crazy idea came about, how we began building a solution, how AI happened to be the missing piece, and how we&#8217;ve grown to serve thousands of unique devices each month with these compiled Python functions.</p><div><hr></div><h2><strong>Containers Are the Wrong Way to Distribute AI</strong></h2><p>I got my start in AI research around 2018, back when we called it &#8220;deep learning&#8221;. I had taken a year off from college and was coming off my first startup experience as co-founder of a venture-backed proptech startup that would later get acquired. One very interesting problem I had encountered in this journey was image editing for residential listings. Each month, a real estate photographer would outsource thousands of photos of homes to be hand-edited in Photoshop and Lightroom, before being posted on the regional MLS or on Zillow.</p><p>I teamed up with an old friend and we set out to build a fully automated image editor, using a new class of vision AI models called Generative Adversarial Networks (GANs). We would train our custom model architectures on our datasets, then test rigorously to ensure that the models worked correctly. But when it came time to get these AI models into the hands of our design partners, we simply got stuck. I spent the majority of my time trying to get our models into something we could distribute very easily. But after months of wrangling with Dockerfiles and third-party services, it became crystal clear to me: <strong>containers are the wrong unit of distribution for AI workloads</strong>.</p><p>To understand why, we need to look into the container. <a href="https://www.redhat.com/en/topics/containers/whats-a-linux-container">Containers</a> are simply self-contained Linux filesystems with runtime isolation and resource management. So when deploying our AI model as a container, we would package up the inference code, the model weights, all the Python package dependencies, the Python interpreter itself, and other required software into what was effectively a snapshot of a full Linux operating system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GnTf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GnTf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png 424w, https://substackcdn.com/image/fetch/$s_!GnTf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png 848w, https://substackcdn.com/image/fetch/$s_!GnTf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png 1272w, https://substackcdn.com/image/fetch/$s_!GnTf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GnTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4272357,&quot;alt&quot;:&quot;Distributing AI inference as self-contained executables as opposed to doing so as a container.&quot;,&quot;title&quot;:&quot;Distributing AI inference as self-contained executables as opposed to doing so as a container.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Distributing AI inference as self-contained executables as opposed to doing so as a container." title="Distributing AI inference as self-contained executables as opposed to doing so as a container." srcset="https://substackcdn.com/image/fetch/$s_!GnTf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png 424w, https://substackcdn.com/image/fetch/$s_!GnTf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png 848w, https://substackcdn.com/image/fetch/$s_!GnTf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png 1272w, https://substackcdn.com/image/fetch/$s_!GnTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb880164d-5b82-42d1-a43f-28f091c21275_3937x2835.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">AI is better distributed in self-contained executables as opposed to containers.</figcaption></figure></div><p>But what if instead of making a self-contained operating system, we made a self-contained executable that ran only our AI model and nothing else? The benefits here would be significant: We could ship much smaller containers that started up much faster, because we wouldn&#8217;t have to include unnecessary Python packages, the Python interpreter itself, or any of the other unnecessary cruft that gets bundled into the container. But even more importantly, not only could we run these executables on our Linux servers&#8212;<em>we could run them anywhere</em>.</p><div><hr></div><h2><strong>Arm64, Apple, and Unity: How It All Began</strong></h2><p>I started programming at the age of eleven, thanks to my dad who vehemently refused to buy me a PlayStation 2 out of fear that my grades would drop. Out of an extreme stubbornness, inherited from him and my mom, I had decided that if he was not going to buy me a game console, then I would simply build the games myself<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. I was lucky enough to find a game engine that was intuitive, allowed developers to build once and deploy everywhere, and most importantly, was free to use: Unity Engine.</p><p>In late 2013 Apple debuted the iPhone 5S, its first device featuring the relatively new <code>armv8-a</code> instruction set architecture. Unlike prior devices, this was a 64-bit architecture running on ARM. With it, apps could address much more memory, and benefit from a myriad of performance gains. As such, Apple quickly mandated that all new apps be compiled for <code>arm64</code>.</p><p>Unity, with its massive developer ecosystem, was thrown into a tailspin. To understand why, we need some context on how Unity works: Because Unity is a game engine, objects within the game can be scripted to have custom behaviors. C# was Unity&#8217;s chosen scripting language for these behaviors. But C# does not compile to object code, so it needs a virtual machine to execute at runtime (sound familiar?). Unity used <a href="https://www.mono-project.com/">Mono</a> for this purpose, but Mono did not support <code>arm64</code>.</p><p>Unity embarked on a journey to build what I still consider to be its greatest engineering feat: <a href="https://unity.com/blog/engine-platform/an-introduction-to-ilcpp-internals">IL2CPP</a>. As its name implies, the IL2CPP compiler would take in Common Intermediate Language bytecode (i.e. the intermediate representation generated by the C# compiler); then emit equivalent C++ source code. Once you had C++ source code, you could compile that code to run just about anywhere: from Nvidia GPUs and WebAssembly; to Apple Silicon and everything in-between.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-aNF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-aNF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png 424w, https://substackcdn.com/image/fetch/$s_!-aNF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png 848w, https://substackcdn.com/image/fetch/$s_!-aNF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!-aNF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-aNF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png" width="800" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c09e1ee0-a580-44ee-bf21-00caae643741_800x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Unity's IL2CPP compiler converts C# code into C++ code. The C++ code can then be compiled to run anywhere.&quot;,&quot;title&quot;:&quot;Unity's IL2CPP compiler converts C# code into C++ code. The C++ code can then be compiled to run anywhere.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Unity's IL2CPP compiler converts C# code into C++ code. The C++ code can then be compiled to run anywhere." title="Unity's IL2CPP compiler converts C# code into C++ code. The C++ code can then be compiled to run anywhere." srcset="https://substackcdn.com/image/fetch/$s_!-aNF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png 424w, https://substackcdn.com/image/fetch/$s_!-aNF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png 848w, https://substackcdn.com/image/fetch/$s_!-aNF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!-aNF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc09e1ee0-a580-44ee-bf21-00caae643741_800x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How the IL2CPP compiler allows Unity run anywhere. Source: <a href="https://unity.com/blog/engine-platform/an-introduction-to-ilcpp-internals">Unity.</a></figcaption></figure></div><p>We set out to build the exact same, for Python.</p><div><hr></div><h2><strong>Sketching Out a Python Compiler</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-Plg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-Plg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png 424w, https://substackcdn.com/image/fetch/$s_!-Plg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png 848w, https://substackcdn.com/image/fetch/$s_!-Plg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png 1272w, https://substackcdn.com/image/fetch/$s_!-Plg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-Plg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png" width="1427" height="959" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:959,&quot;width&quot;:1427,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107996,&quot;alt&quot;:&quot;Python compiler in three steps: tracing, lowering, and compiling.&quot;,&quot;title&quot;:&quot;Python compiler in three steps: tracing, lowering, and compiling.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Python compiler in three steps: tracing, lowering, and compiling." title="Python compiler in three steps: tracing, lowering, and compiling." srcset="https://substackcdn.com/image/fetch/$s_!-Plg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png 424w, https://substackcdn.com/image/fetch/$s_!-Plg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png 848w, https://substackcdn.com/image/fetch/$s_!-Plg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png 1272w, https://substackcdn.com/image/fetch/$s_!-Plg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ad8998-2fcf-4ccd-a11a-abda50a6b3fa_1427x959.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sketching out a Python compiler.</figcaption></figure></div><p>At a high-level, the compiler would:</p><ol><li><p>Ingest plain Python code, with no modifications.</p></li><li><p>Trace it to generate an intermediate representation (IR) graph.</p></li><li><p>Lower the IR to C++ source code.</p></li><li><p>Compile the C++ source code to run across different platforms and architectures.</p></li></ol><p>Before jumping in, you might be wondering: why bother generating C++ first? Why not just go from IR to object code?</p><p>Going back to why we started on this journey, our main focus with Muna has been on compute-intensive applications, especially AI inference. If you&#8217;ve spent time in this space, you&#8217;re familiar with technologies like CUDA, MLX, TensorRT, and so on. But there are so many more frameworks, libraries, and even <a href="https://github.com/corsix/amx">undocumented ISAs</a> that applications can leverage to accelerate everything from matrix multiplication to computer vision.</p><p>We wanted to design a system that would allow us leverage as many ways to perform some computation as we might have available on given hardware. We&#8217;ll show you how we achieved this, and how this design gives us a novel, data-driven approach to performance optimization.</p><div><hr></div><h2><strong>Building a Symbolic Tracer for Python</strong></h2><p>The first step in building our compiler is to build a symbolic tracer. The tracer&#8217;s job is to take in a Python function and emit an intermediate representation (IR) graph that fully captures control flow through the function.</p><p>Our very first prototypes were built upon the <a href="https://docs.pytorch.org/docs/stable/fx.html">PyTorch FX</a> symbolic tracer, introduced in PyTorch 2.0. Their symbolic tracer was built off <a href="https://peps.python.org/pep-0523/">PEP 523</a>, a feature in CPython that allowed developers in C to override how bytecode frames are evaluated by the interpreter. I won&#8217;t go into too much detail here, as it is a marvel of engineering in its own right, but in summary PEP 523 enabled the PyTorch team to <a href="https://dev-discuss.pytorch.org/t/supporting-dynamo-in-python-3-12/2320">register a hook</a> that could record every single function call as it was being evaluated by the interpreter:</p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:143567425,&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/cpython-vm-internals&quot;,&quot;publication_id&quot;:1611829,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;title&quot;:&quot;The Design &amp; Implementation of the CPython Virtual Machine&quot;,&quot;truncated_body_text&quot;:&quot;For every bytecode compiled language, the most interesting part of its implementation is its virtual machine (also referred to as the bytecode interpreter) where the bytecode execution takes place. Because this is such a crucial part of the language machinery, its implementation has to be highly performant. Even if you are not a compiler engineer, learning about such internal implementation can give you new performance tricks and insights that you may be able to use in other places of your job. And, if you are a compiler engineer then you should always look around how other languages are implemented to pickup implementation details that you may not be aware of.&quot;,&quot;date&quot;:&quot;2024-08-31T14:35:14.115Z&quot;,&quot;like_count&quot;:46,&quot;comment_count&quot;:0,&quot;bylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;handle&quot;:&quot;abhinavupadhyay&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;profile_set_up_at&quot;:&quot;2022-11-21T06:38:20.718Z&quot;,&quot;reader_installed_at&quot;:&quot;2023-04-13T10:28:54.373Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:1583741,&quot;user_id&quot;:14520974,&quot;publication_id&quot;:1611829,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:true,&quot;publication&quot;:{&quot;id&quot;:1611829,&quot;name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;subdomain&quot;:&quot;codeconfessions&quot;,&quot;custom_domain&quot;:&quot;blog.codingconfessions.com&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Deep dives into compilers, performance optimization, Linux internals, and low-level programming. For engineers who love understanding systems at a fundamental level.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;author_id&quot;:14520974,&quot;primary_user_id&quot;:14520974,&quot;theme_var_background_pop&quot;:&quot;#121BFA&quot;,&quot;created_at&quot;:&quot;2023-04-24T10:44:31.435Z&quot;,&quot;email_from_name&quot;:&quot;Abhinav from Coding Confessions&quot;,&quot;copyright&quot;:&quot;Abhinav Upadhyay&quot;,&quot;founding_plan_name&quot;:&quot;Founding Member&quot;,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;enabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100,&quot;status&quot;:{&quot;bestsellerTier&quot;:100,&quot;subscriberTier&quot;:null,&quot;leaderboard&quot;:{&quot;ranking&quot;:&quot;paid&quot;,&quot;rank&quot;:241,&quot;publicationName&quot;:&quot;Confessions of a Code Addict&quot;,&quot;label&quot;:&quot;Technology&quot;,&quot;categoryId&quot;:4},&quot;vip&quot;:false,&quot;badge&quot;:{&quot;type&quot;:&quot;bestseller&quot;,&quot;tier&quot;:100}}}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://blog.codingconfessions.com/p/cpython-vm-internals?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!lstI!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png" loading="lazy"><span class="embedded-post-publication-name">Confessions of a Code Addict</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">The Design &amp; Implementation of the CPython Virtual Machine</div></div><div class="embedded-post-body">For every bytecode compiled language, the most interesting part of its implementation is its virtual machine (also referred to as the bytecode interpreter) where the bytecode execution takes place. Because this is such a crucial part of the language machinery, its implementation has to be highly performant. Even if you are not a compiler engineer, learning about such internal implementation can give you new performance tricks and insights that you may be able to use in other places of your job. And, if you are a compiler engineer then you should always look around how other languages are implemented to pickup implementation details that you may not be aware of&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">2 years ago &#183; 46 likes &#183; Abhinav Upadhyay</div></a></div><p>Unfortunately, TorchFX had two significant drawbacks that required us to build a custom tracer. The first is that once you hook into the CPython interpreter to record your PyTorch function, <em>you have to actually run said function</em>. For PyTorch, this was not an issue because you could invoke your function with so-called &#8220;<a href="https://docs.pytorch.org/docs/stable/torch.compiler_fake_tensor.html">fake tensors</a>&#8221; that had the right data types, shapes, and devices, but allocated no memory. Furthermore, this way of running a function in order to trace it would be perfectly inline with how their legacy serialization APIs worked (<code>torch.jit</code> and <code>torch.onnx</code>).</p><p>Since we needed the ability to compile arbitrary Python functions, of which only a tiny subset (or none) could be PyTorch, we would need a similar mechanism for having developers provide us with their inputs to use for tracing. But unlike PyTorch, we could not create a fake image, or fake string, or fake whatever. To us, this became a dead end.</p><p>The second challenge was that even when we created fake data as inputs to the TorchFX tracer, we realized that it could only record PyTorch operations. We would have to heavily modify and extend the tracer to support tracing through arbitrary functions across hundreds and thousands of Python libraries. As such, we settled on building a tracer that would instead capture a Python function by parsing its abstract syntax tree (AST). Take an example function:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MwPB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MwPB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png 424w, https://substackcdn.com/image/fetch/$s_!MwPB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png 848w, https://substackcdn.com/image/fetch/$s_!MwPB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png 1272w, https://substackcdn.com/image/fetch/$s_!MwPB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MwPB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png" width="1456" height="361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:106657,&quot;alt&quot;:&quot;Python function that computes the area of a shape.&quot;,&quot;title&quot;:&quot;Python function that computes the area of a shape.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Python function that computes the area of a shape." title="Python function that computes the area of a shape." srcset="https://substackcdn.com/image/fetch/$s_!MwPB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png 424w, https://substackcdn.com/image/fetch/$s_!MwPB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png 848w, https://substackcdn.com/image/fetch/$s_!MwPB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png 1272w, https://substackcdn.com/image/fetch/$s_!MwPB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F107c78f0-f099-429a-8477-be2b3e6c5f92_2856x708.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Simple function that computes the area of a shape.</figcaption></figure></div><p>Our tracer would first extract an AST like so:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xZMZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xZMZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png 424w, https://substackcdn.com/image/fetch/$s_!xZMZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png 848w, https://substackcdn.com/image/fetch/$s_!xZMZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png 1272w, https://substackcdn.com/image/fetch/$s_!xZMZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xZMZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png" width="1456" height="1876" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1876,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:619243,&quot;alt&quot;:&quot;Visualized AST of the Python function that computes the area of a shape.&quot;,&quot;title&quot;:&quot;Visualized AST of the Python function that computes the area of a shape.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Visualized AST of the Python function that computes the area of a shape." title="Visualized AST of the Python function that computes the area of a shape." srcset="https://substackcdn.com/image/fetch/$s_!xZMZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png 424w, https://substackcdn.com/image/fetch/$s_!xZMZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png 848w, https://substackcdn.com/image/fetch/$s_!xZMZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png 1272w, https://substackcdn.com/image/fetch/$s_!xZMZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fede0c5ab-511d-4f36-80bb-7e2100589382_3064x3948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visualized AST of the `compute_area` function above.</figcaption></figure></div><p>It would then step through, resolve all function calls (i.e. figure out what source library each function call belongs to), then emit a proprietary IR format. Currently, our symbolic tracer supports static analysis (via AST parsing); partial evaluation of the original Python code; live value introspection (using a <a href="https://docs.muna.ai/predictors/sandbox">sandbox</a>), and much more. But somehow, it&#8217;s the least interesting part of our compiler pipeline.</p><div><hr></div><h2><strong>Lowering to C++ via Type Propagation</strong></h2><p>This is where things get really interesting. Python is a dynamic language, so variables can be of any type, and those types can change easily:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4tIy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4tIy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png 424w, https://substackcdn.com/image/fetch/$s_!4tIy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png 848w, https://substackcdn.com/image/fetch/$s_!4tIy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!4tIy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4tIy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png" width="1456" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177131,&quot;alt&quot;:&quot;Example code showing invoking a Python function multiple times, each with different argument types.&quot;,&quot;title&quot;:&quot;Example code showing invoking a Python function multiple times, each with different argument types.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Example code showing invoking a Python function multiple times, each with different argument types." title="Example code showing invoking a Python function multiple times, each with different argument types." srcset="https://substackcdn.com/image/fetch/$s_!4tIy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png 424w, https://substackcdn.com/image/fetch/$s_!4tIy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png 848w, https://substackcdn.com/image/fetch/$s_!4tIy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!4tIy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87d066b-a371-4be1-bc00-b644075ea652_2744x1068.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example demonstrating Python&#8217;s dynamic nature.</figcaption></figure></div><p>C++ on the other hand, is a strongly-typed language, where variables have distinct, immutable types that must be known when the variable is declared. While bridging both of these languages might seem like an intractable problem, there&#8217;s actually a key insight we can take advantage of:</p><p>When we invoke a Python function with some given inputs, we can uniquely determine the types of all intermediate variables within that function<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RmBV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RmBV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png 424w, https://substackcdn.com/image/fetch/$s_!RmBV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png 848w, https://substackcdn.com/image/fetch/$s_!RmBV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png 1272w, https://substackcdn.com/image/fetch/$s_!RmBV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RmBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png" width="1456" height="1298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1298,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:239055,&quot;alt&quot;:&quot;Graph showing how variable types flow through a Python function when invoked.&quot;,&quot;title&quot;:&quot;Graph showing how variable types flow through a Python function when invoked.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Graph showing how variable types flow through a Python function when invoked." title="Graph showing how variable types flow through a Python function when invoked." srcset="https://substackcdn.com/image/fetch/$s_!RmBV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png 424w, https://substackcdn.com/image/fetch/$s_!RmBV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png 848w, https://substackcdn.com/image/fetch/$s_!RmBV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png 1272w, https://substackcdn.com/image/fetch/$s_!RmBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe4c9d13-8598-4b2a-a3c3-69472c1eb051_2465x2198.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">We can track the types of every variable when a Python function is invoked.</figcaption></figure></div><p>If we know that the inputs <code>x</code> and <code>y</code> are <code>float</code> instances, then we know that the resulting type of their multiplication (i.e., <code>tmp_1</code>) is uniquely determined by whatever the <code>operator.mul</code> function returns. But how do we define <code>operator.mul</code> and get its return type? That&#8217;s where C++<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> comes in.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Dae!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Dae!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png 424w, https://substackcdn.com/image/fetch/$s_!9Dae!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png 848w, https://substackcdn.com/image/fetch/$s_!9Dae!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png 1272w, https://substackcdn.com/image/fetch/$s_!9Dae!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Dae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png" width="1456" height="293" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:293,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56235,&quot;alt&quot;:&quot;C++ code showing one possible implementation of Python's multiplication operator.&quot;,&quot;title&quot;:&quot;C++ code showing one possible implementation of Python's multiplication operator.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="C++ code showing one possible implementation of Python's multiplication operator." title="C++ code showing one possible implementation of Python's multiplication operator." srcset="https://substackcdn.com/image/fetch/$s_!9Dae!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png 424w, https://substackcdn.com/image/fetch/$s_!9Dae!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png 848w, https://substackcdn.com/image/fetch/$s_!9Dae!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png 1272w, https://substackcdn.com/image/fetch/$s_!9Dae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16b47dc6-049d-4c37-9510-87643e0e4202_2624x528.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Example C++ implementation of Python&#8217;s multiplication operator.</figcaption></figure></div><p>From the above, we now know that <code>tmp_1</code> must be a <code>float</code>. We can repeat this process for the addition call (<code>tmp_1 + z</code>) to get the final result.</p><p>At this point, it&#8217;s worth taking a moment to reflect on what we have created thus far:</p><ol><li><p>We can take a Python function and generate an intermediate representation (IR) that fully captures what it does.</p></li><li><p>We can then use parameter type information; and a C++ implementation of a Python operator (e.g. <code>operator.mul</code>); to fully determine the type of the first intermediate variable in our Python function.</p></li><li><p>We can repeat (2) for all subsequent intermediate variables in our Python function, until we have propagated types throughout the entire function.</p></li></ol><div><hr></div><h2><strong>Seeding the Type Propagation Process</strong></h2><p>One point worth expanding upon is how we get the initial parameter type information to kickstart the type propagation process. In the example above, how do we know that each of <code>x</code>, <code>y</code>, and <code>z</code> are <code>float</code> instances?</p><p>After prototyping with a few different approaches, we settled on <a href="https://peps.python.org/pep-0484/">PEP 484</a> which added support for type annotations in the Python language. Python itself completely ignores these type annotations, as they are not used at runtime<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. And while they solved the problem of seeding type propagation, they came with two major drawbacks: first, they conflict with our most important design goal, because using them requires developers to modify their Python code a little<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!svCg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!svCg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png 424w, https://substackcdn.com/image/fetch/$s_!svCg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png 848w, https://substackcdn.com/image/fetch/$s_!svCg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png 1272w, https://substackcdn.com/image/fetch/$s_!svCg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!svCg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png" width="1456" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:112867,&quot;alt&quot;:&quot;Python code showing the same `compute_area` function from above, but with added type annotations.&quot;,&quot;title&quot;:&quot;Python code showing the same `compute_area` function from above, but with added type annotations.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Python code showing the same `compute_area` function from above, but with added type annotations." title="Python code showing the same `compute_area` function from above, but with added type annotations." srcset="https://substackcdn.com/image/fetch/$s_!svCg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png 424w, https://substackcdn.com/image/fetch/$s_!svCg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png 848w, https://substackcdn.com/image/fetch/$s_!svCg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png 1272w, https://substackcdn.com/image/fetch/$s_!svCg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1bf97ed-75d0-417f-889e-337c3b53f80f_2624x708.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Adding type annotations to our Python function.</figcaption></figure></div><p>The code doesn&#8217;t look too different, and some argue that using type annotations makes for writing better Python code (<a href="https://docs.muna.ai/oss/style/python#type-annotations">we mandate them at Muna</a>). The second problem was that in order to design a simple and modular interface for consuming the compiled functions, we would have to constrain the <a href="https://docs.muna.ai/predictors/requirements#function-signature">number of distinct input types</a> that could be used by developers<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. Ultimately, we decided this was a reasonable compromise with nice ergonomics.</p><div><hr></div><h2><strong>Building a Library of C++ Operators</strong></h2><p>At this point, you might have realized a glaring issue in our design: we need to write C++ implementations for potentially tens or hundreds of thousands of Python functions across different libraries. Thankfully, this is a lot less complicated than you might think. Consider the function below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vhSC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vhSC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png 424w, https://substackcdn.com/image/fetch/$s_!vhSC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png 848w, https://substackcdn.com/image/fetch/$s_!vhSC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png 1272w, https://substackcdn.com/image/fetch/$s_!vhSC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vhSC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png" width="1456" height="493" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:493,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120400,&quot;alt&quot;:&quot;Image showing a Python function that invokes two other functions, one of which is defined while the other is imported.&quot;,&quot;title&quot;:&quot;Image showing a Python function that invokes two other functions, one of which is defined while the other is imported.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image showing a Python function that invokes two other functions, one of which is defined while the other is imported." title="Image showing a Python function that invokes two other functions, one of which is defined while the other is imported." srcset="https://substackcdn.com/image/fetch/$s_!vhSC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png 424w, https://substackcdn.com/image/fetch/$s_!vhSC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png 848w, https://substackcdn.com/image/fetch/$s_!vhSC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png 1272w, https://substackcdn.com/image/fetch/$s_!vhSC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52de065b-b734-42fe-96f4-b3ba241efd4a_2624x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">We only need C++ to cover functions whose definitions are not available.</figcaption></figure></div><p>When compiling <code>cosecant</code>, we see that there are function calls to <code>sin</code> and <code>reciprocal</code>. Our compiler first checks if it can trace through each function call. In the case of <code>sin</code>, we don&#8217;t have a function definition for it (only an <code>import</code>), so we cannot trace through it<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. This forms a leaf node that we must implement manually in C++. We can trace through the call to <code>reciprocal</code>, so we do and get an IR graph for it. This can then be lowered and used at other call sites.</p><p>The key insight above is that most Python functions our compiler will encounter are composed of a smaller set of elementary functions. What accounts for the large variety of code in the wild is not the unique number of elementary functions that make them up; rather, it&#8217;s the different arrangements of these elementary functions.</p><p>Still, you could argue that there are potentially thousands of these elementary functions across different libraries that we would have to cover, and you would be 100% correct. Thankfully, we now have an amazing tool that makes this an easy problem to solve: AI-powered code generation.</p><p>Today&#8217;s LLMs are capable of writing verifiably-correct, high-performance code across a wide variety of programming languages. As such, we&#8217;ve been building infrastructure to constrain the code they generate, test the code to ensure correctness, and handle ancillary logic for things like dependency management and conditional compilation. So far, we have used AI to generate implementations of hundreds of Python functions <a href="https://docs.muna.ai/predictors/requirements#supported-libraries">across popular libraries</a> like Numpy, OpenCV, and PyTorch.</p><div><hr></div><h2><strong>Performance Optimization via Exhaustive Search</strong></h2><p>The final topic worth discussing is performance optimization. Most popular approaches here involve rolling out hand-written code (e.g. Assembly or PTX); using heterogenous accelerators (e.g. GPU, NPU); doing heuristic-based algorithm selection at runtime (e.g. convolution algo search in <a href="https://artificial-intelligence.sites.arm.com/computelibrary/v52.4.0/conv2d_heuristic.xhtml">ArmCL</a> and <a href="https://docs.nvidia.com/deeplearning/cudnn/backend/latest/api/cudnn-cnn-library.html#cudnnfindconvolutionforwardalgorithm">cuDNN</a>); or some combination thereof.</p><p>From our past experience building extremely low-latency computer vision pipelines for embedded systems, we have learned a very bitter lesson<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>: effective performance optimization <strong>is always empirical</strong>. The latency of a given operation on some given hardware depends on so many factors that the only way to know <em>for sure </em>it to simply test every single approach you have. The only reason why engineering teams don&#8217;t do this is because it is impractical: you would have to rewrite your code tens or hundreds of times then test each variant&#8230;but wait!</p><p>Earlier, we went over how we propagated types through a Python function with the help of a C++ operator. What I didn&#8217;t mention was that we don&#8217;t just use one C++ operator; we use as many as we can write (*ahem* generate). So instead of this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!leiG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!leiG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png 424w, https://substackcdn.com/image/fetch/$s_!leiG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png 848w, https://substackcdn.com/image/fetch/$s_!leiG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png 1272w, https://substackcdn.com/image/fetch/$s_!leiG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!leiG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png" width="1456" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54214,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!leiG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png 424w, https://substackcdn.com/image/fetch/$s_!leiG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png 848w, https://substackcdn.com/image/fetch/$s_!leiG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png 1272w, https://substackcdn.com/image/fetch/$s_!leiG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c96b5b-48e0-46ec-8aaa-4a7f9bd75371_1711x388.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">We don&#8217;t just generate one C++ program from a Python function.</figcaption></figure></div><p>What really happens is this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DPWW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DPWW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png 424w, https://substackcdn.com/image/fetch/$s_!DPWW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png 848w, https://substackcdn.com/image/fetch/$s_!DPWW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png 1272w, https://substackcdn.com/image/fetch/$s_!DPWW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DPWW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png" width="1456" height="1640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1640,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:471061,&quot;alt&quot;:&quot;Graph of many C++ programs we generate from a single Python function.&quot;,&quot;title&quot;:&quot;Graph of many C++ programs we generate from a single Python function.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Graph of many C++ programs we generate from a single Python function." title="Graph of many C++ programs we generate from a single Python function." srcset="https://substackcdn.com/image/fetch/$s_!DPWW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png 424w, https://substackcdn.com/image/fetch/$s_!DPWW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png 848w, https://substackcdn.com/image/fetch/$s_!DPWW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png 1272w, https://substackcdn.com/image/fetch/$s_!DPWW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7da62f0-4f60-40fc-bc41-b05fcbe074de_2475x2787.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">We generate as many C++ programs as we can from a single Python function.</figcaption></figure></div><p>Each path from <code>start</code> to <code>result</code> is a unique program, guaranteed to be correct with respect to the original Python function. But each C++ operator (colored rectangles) could be powered by different algorithms, libraries, and even hardware accelerators. Let&#8217;s walk through a concrete example:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PE_E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PE_E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png 424w, https://substackcdn.com/image/fetch/$s_!PE_E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png 848w, https://substackcdn.com/image/fetch/$s_!PE_E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png 1272w, https://substackcdn.com/image/fetch/$s_!PE_E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PE_E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png" width="1456" height="409" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:409,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138697,&quot;alt&quot;:&quot;Python function that resizes an image to 64x64.&quot;,&quot;title&quot;:&quot;Python function that resizes an image to 64x64.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Python function that resizes an image to 64x64." title="Python function that resizes an image to 64x64." srcset="https://substackcdn.com/image/fetch/$s_!PE_E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png 424w, https://substackcdn.com/image/fetch/$s_!PE_E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png 848w, https://substackcdn.com/image/fetch/$s_!PE_E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png 1272w, https://substackcdn.com/image/fetch/$s_!PE_E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea5db84-da19-4929-9444-6f99ab5d4665_2832x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A simple Python function to resize an image.</figcaption></figure></div><p>The function above resizes an input image to <code>64x64</code> with bilinear resampling using the <code>torchvision</code> library. When compiling this function for Apple Silicon (macOS, iOS, or visionOS), we have a range of approaches and libraries to choose from, including:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_gxd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_gxd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png 424w, https://substackcdn.com/image/fetch/$s_!_gxd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png 848w, https://substackcdn.com/image/fetch/$s_!_gxd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png 1272w, https://substackcdn.com/image/fetch/$s_!_gxd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_gxd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png" width="1456" height="1355" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1355,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2025724,&quot;alt&quot;:&quot;Image showing how we generate C++ programs that use different approaches to resize an image.&quot;,&quot;title&quot;:&quot;Image showing how we generate C++ programs that use different approaches to resize an image.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image showing how we generate C++ programs that use different approaches to resize an image." title="Image showing how we generate C++ programs that use different approaches to resize an image." srcset="https://substackcdn.com/image/fetch/$s_!_gxd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png 424w, https://substackcdn.com/image/fetch/$s_!_gxd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png 848w, https://substackcdn.com/image/fetch/$s_!_gxd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png 1272w, https://substackcdn.com/image/fetch/$s_!_gxd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64eb40c2-6dde-4e93-8d65-8a1d40042afa_4342x4042.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Each implementation uses a different approach to resizing the image.</figcaption></figure></div><p>The above is just a small selection, as the possibilities for implementing a bilinear resize operation on Apple Silicon are numerous (e.g. using the GPU, Neural Engine). The key here is that we can generate as many of these as possible (thanks to LLM-powered codegen), then emit compiled programs that use each one&#8212;with absolutely no limits. So in the example above, the user&#8217;s Python function would be emitted as four unique programs for Apple Silicon alone. In our real-world testing, we have seen a single Python function be emitted as almost 200 unique programs across 9 compile targets.</p><p>From here, we can easily test each compiled function to discover which one runs the fastest on given hardware. We gather fine-grained telemetry data, containing latency information for each operation, and use this data to build statistical models to predict which variant runs the fastest. There are two significant benefits in this design:</p><ol><li><p>We can optimize code purely empirically. We don&#8217;t make any assumptions about which code might perform best; and we don&#8217;t need a separate performance tuning step after generating code. We simply ship out every compiled binary we have, gather telemetry data, and use this to discover which one is the fastest.</p></li><li><p>We benefit from network effects. Because the C++ operators are shared among thousands of compiled functions; and because we ship these compiled functions to hundreds of thousands of unique devices across all of our users; we have tons of data that we can use to optimize every piece of code we generate.</p></li></ol><p>For our users, this will feel like their compiled Python functions running faster over time, entirely on autopilot.</p><div><hr></div><h2><strong>Designing a User Interface for the Compiler</strong></h2><p>Now, we have to wrap up everything we&#8217;ve covered above into a user interface. Our most important guiding principle was to design something with near-zero cognitive load. Specifically, we didn&#8217;t want developers to have to learn anything new to use the compiler. We decided to go with <a href="https://peps.python.org/pep-0318/">PEP 318</a>, decorators:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-CJO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-CJO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png 424w, https://substackcdn.com/image/fetch/$s_!-CJO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png 848w, https://substackcdn.com/image/fetch/$s_!-CJO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png 1272w, https://substackcdn.com/image/fetch/$s_!-CJO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-CJO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png" width="1456" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152463,&quot;alt&quot;:&quot;Image showing how developers use our compiler by adding an `@compile` decorator to their function.&quot;,&quot;title&quot;:&quot;Image showing how developers use our compiler by adding an `@compile` decorator to their function.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image showing how developers use our compiler by adding an `@compile` decorator to their function." title="Image showing how developers use our compiler by adding an `@compile` decorator to their function." srcset="https://substackcdn.com/image/fetch/$s_!-CJO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png 424w, https://substackcdn.com/image/fetch/$s_!-CJO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png 848w, https://substackcdn.com/image/fetch/$s_!-CJO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png 1272w, https://substackcdn.com/image/fetch/$s_!-CJO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea421690-0de0-4c1e-aaed-6d07ec93b38d_2512x976.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Developers simply have to <code>@compile</code> their Python function.</figcaption></figure></div><p>Developers could simply decorate their Python function with <code>@compile</code> to specify the compilation entrypoint. Then, they would compile the function<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a> and all its dependencies using the CLI:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IFp_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IFp_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png 424w, https://substackcdn.com/image/fetch/$s_!IFp_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png 848w, https://substackcdn.com/image/fetch/$s_!IFp_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png 1272w, https://substackcdn.com/image/fetch/$s_!IFp_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IFp_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png" width="1456" height="253" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:253,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65221,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!IFp_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png 424w, https://substackcdn.com/image/fetch/$s_!IFp_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png 848w, https://substackcdn.com/image/fetch/$s_!IFp_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png 1272w, https://substackcdn.com/image/fetch/$s_!IFp_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa11e5651-c5e6-4628-8b3d-a5cc74bc34e7_2512x436.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jqDd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jqDd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif 424w, https://substackcdn.com/image/fetch/$s_!jqDd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif 848w, https://substackcdn.com/image/fetch/$s_!jqDd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif 1272w, https://substackcdn.com/image/fetch/$s_!jqDd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jqDd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif" width="724.703125" height="388.73155262706047" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724.703125,&quot;bytes&quot;:1886978,&quot;alt&quot;:&quot;Animated image showing compilation of a Python function with the Muna command line interface.&quot;,&quot;title&quot;:&quot;Animated image showing compilation of a Python function with the Muna command line interface.&quot;,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="Animated image showing compilation of a Python function with the Muna command line interface." title="Animated image showing compilation of a Python function with the Muna command line interface." srcset="https://substackcdn.com/image/fetch/$s_!jqDd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif 424w, https://substackcdn.com/image/fetch/$s_!jqDd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif 848w, https://substackcdn.com/image/fetch/$s_!jqDd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif 1272w, https://substackcdn.com/image/fetch/$s_!jqDd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d262d1a-bc43-449b-944e-ca52da798f26_2012x1080.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Compiling a Python function with the Muna command line interface.</figcaption></figure></div><p>We fell in love with the decorator paradigm from seeing how developers strongly preferred expressing complex infrastructure as code<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>. Furthermore, it was a familiar form factor within the Python ecosystem, evidenced by its use within Numba and PyTorch. With the decorator, our CLI could find the compilation entrypoint function, and use that as a springboard to crawl through all other dependency code (both first-party as provided by the developer, and third-party packages installed via <code>pip</code> or <code>uv</code>).</p><p>The <code>@compile</code> decorator would also serve as the primary customization point for developers compiling their function. Beyond the required <code>tag</code> (which uniquely identifies the function on our platform) and <code>description</code>, developers could provide a sandbox description to recreate their local development environment (e.g. installing Python packages, uploading files); along with <code>metadata</code> to assist the compiler during codegen (e.g. <a href="https://docs.muna.ai/predictors/ai#inference-backends">running PyTorch AI inference</a> with ONNXRuntime, TensorRT, CoreML, IREE, QNN, and more).</p><p>Once compiled, anyone can run the compiled function anywhere<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HbPW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HbPW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png 424w, https://substackcdn.com/image/fetch/$s_!HbPW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png 848w, https://substackcdn.com/image/fetch/$s_!HbPW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png 1272w, https://substackcdn.com/image/fetch/$s_!HbPW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HbPW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png" width="1456" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:293267,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.muna.ai/i/171980795?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HbPW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png 424w, https://substackcdn.com/image/fetch/$s_!HbPW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png 848w, https://substackcdn.com/image/fetch/$s_!HbPW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png 1272w, https://substackcdn.com/image/fetch/$s_!HbPW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9d2422-c320-4dfd-9b56-732e1a334b6f_2756x1608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Invoking the compiled function with the Muna command line interface.</figcaption></figure></div><div><hr></div><h2><strong>Closing Thoughts</strong></h2><p>In all candor, we still have a high level of disbelief that any of this <em>actually</em> works. That said, the compiler has a lot of standard Python features that are partial or missing: exceptions, lambda expressions, recursive functions; and classes. The through-line connecting these missing features is our type propagation system. While type propagation works for simple functions with unitary parameter and return types, it requires additional consideration for composite types (e.g. unions) and higher-order types (e.g. classes, lambda expressions).</p><p>The other significant item we are still figuring out is the debugging experience. The good news for us is that we guarantee that developers&#8217; Python code will work as expected once compiled, absolving them of any responsibility to debug the code at runtime. This is similar to how developers who use Docker or other containerization technologies simply expect everything to work&#8212;almost nobody debugs their Docker image layers. The bad news is that because we enable developers run their Python code anywhere, we have to figure out how to write extremely safe code; and how to gather fine-grained, symbolicated trace data when some function raises an exception. This is further complicated by the fact that because we have to deliver the smallest and fastest compiled binaries possible, we compile generated code with full optimizations, inevitably stripping out valuable debug information.</p><p>It has not all been difficult though, especially because the evolving C++ standard has been a major boon for us. Muna would not exist without C++20, because our code generation relies extensively on <code>std::span</code>, concepts, and most importantly, coroutines. And we&#8217;re dying for broad C++23 support, because we use <code>std::generator</code> to <a href="https://docs.muna.ai/predictions/stream">support streaming</a>, <code>&lt;stdfloat&gt;</code> to support <code>float16_t</code> and <code>bfloat16_t</code>, and <code>&lt;stacktrace&gt;</code> to support Python exceptions.</p><p>On a final note, if you are currently deploying embedding models or object detection models in your organization, or if you find any of this work interesting, <a href="https://muna.ai/slack">we would love to chat with you</a>. We&#8217;d love for more developers to use the compiler on problems and programs we haven&#8217;t yet run ourselves; and we love to meet developers who enjoy the mundane, low-level worlds of Python, C++, and everything in-between.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://muna.ai/slack&quot;,&quot;text&quot;:&quot;Come Chat with Us&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://muna.ai/slack"><span>Come Chat with Us</span></a></p><div data-component-name="FragmentNodeToDOM"><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/compiling-python-to-run-anywhere?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/compiling-python-to-run-anywhere?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p><a href="https://youtu.be/IHHXYqdCV_M?t=21">Rumble Racing</a> and <a href="https://youtu.be/dz9hN_dfLz0">Sly Cooper</a>: Not the most well-known titles on the PS2, but games that carry incredibly amounts of sentimental value from my upbringing.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>One way to think about bridging Python and C++ during lowering is in how implicit template instantiation works in C++. The Python function defines a template function; and the input types are used to instantiate a concrete function therefrom.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Technically, our compiler doesn&#8217;t just compile to C++. We use C++ as the primary language for code generation, but we also emit Objective-C and Rust in some cases. Furthermore, we are actively exploring emitting Mojo.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The main exception here comprises of data validation libraries like <a href="https://docs.pydantic.dev/latest/">Pydantic</a> which use type hints to build schemas for validating and serializing data.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Only the compiler entrypoint function (i.e. the function which is decorated with <code>@compile</code>) requires type annotations. All other functions can be duck typed as normal.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Only the compiler entrypoint function (i.e. the function which is decorated with <code>@compile</code>) is subject to this constraint. All other functions can accept arbitrary input types, and return arbitrary output types.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Our <code>@compile</code> decorator supports providing a list of <code>trace_modules</code> which opt entire modules into tracing. Functions that are not provided as part of a developer&#8217;s original Python code must explicitly be opted into tracing.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p><a href="https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf">The Bitter Lesson</a> by Rich Sutton.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>By default, we currently compile for Android, iOS, Linux, macOS, WebAssembly, and Windows.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>We took inspiration from projects like Pulumi and Modal.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>We provide client libraries for <a href="https://github.com/muna-ai/muna-py">Python</a>, <a href="https://github.com/muna-ai/muna-js">JavaScript</a> (browser and Node.js), <a href="https://github.com/muna-ai/muna-swift">Swift</a> (iOS), <a href="https://central.sonatype.com/artifact/ai.muna/muna">Kotlin</a> (Android), and <a href="https://github.com/muna-ai/muna-unity">Unity Engine</a>. And our React Native client is coming soon.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[What Makes System Calls Expensive: A Linux Internals Deep Dive]]></title><description><![CDATA[An explanation of how Linux handles system calls on x86-64 and why they show up as expensive operations in performance profiles]]></description><link>https://blog.codingconfessions.com/p/what-makes-system-calls-expensive</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/what-makes-system-calls-expensive</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Tue, 16 Sep 2025 18:03:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eEc8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eEc8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eEc8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!eEc8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!eEc8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!eEc8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eEc8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120562,&quot;alt&quot;:&quot;Cover: A Flamegraph highlighting performance overhead due to system calls&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172480326?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cover: A Flamegraph highlighting performance overhead due to system calls" title="Cover: A Flamegraph highlighting performance overhead due to system calls" srcset="https://substackcdn.com/image/fetch/$s_!eEc8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!eEc8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!eEc8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!eEc8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa686ab6a-05dc-4e23-ab86-75fce9a66356_1536x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cover: A Flamegraph highlighting performance overhead due to system calls</figcaption></figure></div><p>System calls are how user programs talk to the operating system. They include opening files, reading the current time, creating processes, and more. They&#8217;re unavoidable, but they&#8217;re also not cheap.</p><p>If you&#8217;ve ever looked at a flame graph, you&#8217;ll notice system calls often show up as hot spots. Engineers spend a lot of effort cutting them down, and whole features such as io_uring for batching I/O or eBPF for running code inside the kernel exist just to reduce how often programs have to cross into kernel mode.</p><p>Why are they so costly? The obvious part is the small bit of kernel code that runs for each call. The bigger cost comes from what happens around it: every transition into the kernel makes the CPU drop its optimizations, flush pipelines, and reset predictor state, then rebuild them again on return. This disruption is what makes system calls much more expensive than they appear in the source code.</p><p>In this article, we&#8217;ll look at what really happens when you make a system call on Linux x86-64. We&#8217;ll follow the kernel entry and exit path, analyse the direct overheads, and then dig into the indirect microarchitectural side-effects that explain why minimizing system calls is such an important optimization.</p><div><hr></div><h3>CodeRabbit: Free AI Code Reviews in CLI (<em>Sponsored</em>)</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://coderabbit.link/fIVg8LI" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hJCC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 424w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 848w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 1272w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hJCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png" width="1456" height="879" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:879,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://coderabbit.link/fIVg8LI&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code" title="CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code" srcset="https://substackcdn.com/image/fetch/$s_!hJCC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 424w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 848w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 1272w, https://substackcdn.com/image/fetch/$s_!hJCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39af5e24-3ccb-4f2d-808b-c3f5fc9c23dc_1600x966.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">CodeRabbit CLI: A Code Review Agent to Review your AI Generated Code</figcaption></figure></div><p>As developers increasingly turn to CLI coding agents like Claude Code for rapid development, a critical gap emerges: who reviews the AI-generated code? CodeRabbit CLI fills this void by delivering senior-level code reviews directly in your terminal, creating a seamless workflow where code generation flows directly into automated validation. Review uncommitted changes, catch AI hallucinations, and get one-click fixes - all without leaving your command line. It's the quality gate that makes autonomous coding truly possible, ensuring every line of AI-generated code meets production standards before it ships.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://coderabbit.link/fIVg8LI&quot;,&quot;text&quot;:&quot;Get Started Today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://coderabbit.link/fIVg8LI"><span>Get Started Today</span></a></p><div><hr></div><h2>Background on System Calls</h2><p>Let&#8217;s start with a quick overview of system calls. These are routines inside the kernel that provide specific services to user space. They live in the kernel because they need privileged access to registers, instructions, or hardware devices. For example, reading a file from disk requires talking to the disk controller, and creating a new process requires allocating hardware resources. Both are privileged operations, which is why they are system calls.</p><p>Calling a system call requires a special mechanism to switch execution from user space to kernel space. On x86-64 this is done using the <code>syscall</code> instruction, where you place the syscall number in <code>rax</code> and the arguments in registers (<code>rdi</code>, <code>rsi</code>, <code>rdx</code>, <code>r10</code>, <code>r9</code>, <code>r8</code>), then invoke <code>syscall</code>:</p><pre><code># set args for calling read syscall
movq $1, %rax
movq $1, %rdi
movq $buf, %rsi
movq $size, %rdx
syscall # we enter the kernel here
movq %rax, %rbx</code></pre><p>On encountering this instruction, the processor switches to kernel mode and jumps to the registered syscall entry path. The kernel completes the context switch (switching the page tables and stack) and then jumps to the specific syscall implementation.</p><p>When the syscall finishes, it places the return value in <code>rax</code> and returns. Returning requires another privilege mode switch, reversing everything done on entry: restoring the user page table, stack, and registers.</p><p>The following diagram illustrates the sequence of steps required to execute a system call (<code>read</code> in this case). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vTBi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vTBi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png 424w, https://substackcdn.com/image/fetch/$s_!vTBi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png 848w, https://substackcdn.com/image/fetch/$s_!vTBi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png 1272w, https://substackcdn.com/image/fetch/$s_!vTBi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vTBi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png" width="857" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27883ed8-3363-4575-b71d-3aa1d3dfead6_857x567.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:857,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72116,&quot;alt&quot;:&quot;Flow of a read system call: user space sets up arguments and invokes syscall, control transfers to the kernel entry handler, the kernel executes the system call (keys_read), and then returns control back to user space.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172480326?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27883ed8-3363-4575-b71d-3aa1d3dfead6_857x567.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Flow of a read system call: user space sets up arguments and invokes syscall, control transfers to the kernel entry handler, the kernel executes the system call (keys_read), and then returns control back to user space." title="Flow of a read system call: user space sets up arguments and invokes syscall, control transfers to the kernel entry handler, the kernel executes the system call (keys_read), and then returns control back to user space." srcset="https://substackcdn.com/image/fetch/$s_!vTBi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png 424w, https://substackcdn.com/image/fetch/$s_!vTBi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png 848w, https://substackcdn.com/image/fetch/$s_!vTBi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png 1272w, https://substackcdn.com/image/fetch/$s_!vTBi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d138001-3c2c-4d7d-8797-9db03f0cab97_857x567.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Flow of a <code>read</code> system call: user space sets up arguments and invokes <code>syscall</code>, control transfers to the kernel entry handler, the kernel executes the system call (<code>keys_read</code>), and then returns control back to user space.</figcaption></figure></div><p>In the figure:</p><ul><li><p>User space code sets up arguments for the <code>read</code> system call.</p></li><li><p>It invokes the system call using the <code>syscall</code> instruction.</p></li><li><p>The instruction switches to kernel mode and enters the syscall entry handler, where the kernel switches to its own page table and stack.</p></li><li><p>The kernel then jumps to the implementation of the <code>read</code> system call.</p></li><li><p>After returning, the kernel restores the user space page table and stack, then control resumes at the next user instruction.</p></li></ul><p>Now that we have this high-level overview, let&#8217;s look inside the Linux kernel&#8217;s syscall handler to understand each step in more detail.</p><h2>Inside the Linux Syscall Handler</h2><p>When a system call is invoked, the CPU jumps into the kernel&#8217;s designated system call handler. The following diagram shows the Linux kernel code for this handler for the x86-64 architecture from the file <a href="https://elixir.bootlin.com/linux/v6.12/source/arch/x86/entry/entry_64.S">entry_64.S</a>. In the diagram, you can see the set of steps the kernel needs to perform before it can actually execute the system call. Let&#8217;s briefly discuss each of these.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YXRa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YXRa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png 424w, https://substackcdn.com/image/fetch/$s_!YXRa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png 848w, https://substackcdn.com/image/fetch/$s_!YXRa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png 1272w, https://substackcdn.com/image/fetch/$s_!YXRa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YXRa!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png" width="1200" height="752.542372881356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6a0cc24-9db2-4892-b31e-88ca13559cf0_1180x740.png&quot;,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:740,&quot;width&quot;:1180,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:180885,&quot;alt&quot;:&quot;Actual x86-64 syscall entry code from Linux kernel (entry_64.S), annotated to show the steps the kernel performs before invoking the system call.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172480326?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6a0cc24-9db2-4892-b31e-88ca13559cf0_1180x740.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="Actual x86-64 syscall entry code from Linux kernel (entry_64.S), annotated to show the steps the kernel performs before invoking the system call." title="Actual x86-64 syscall entry code from Linux kernel (entry_64.S), annotated to show the steps the kernel performs before invoking the system call." srcset="https://substackcdn.com/image/fetch/$s_!YXRa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png 424w, https://substackcdn.com/image/fetch/$s_!YXRa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png 848w, https://substackcdn.com/image/fetch/$s_!YXRa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png 1272w, https://substackcdn.com/image/fetch/$s_!YXRa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21546e-2b50-4079-95a1-fb9be0d6bd05_1180x740.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Actual x86-64 syscall entry code from Linux kernel (<code>entry_64.S</code>), annotated to show the steps the kernel performs before invoking the system call.</figcaption></figure></div><h3>Swapping the GS Register</h3><p>GS is a segment register in the x86 architecture. In user space it is primarily used for <a href="https://en.wikipedia.org/wiki/Thread-local_storage">thread-local storage</a> (TLS). In kernel space it is used for implementing per-cpu variables, such as a pointer to the currently executing task. So, the first thing that the kernel does is restore the kernel mode value of the GS register.</p><h3>Switching to Kernel Page Table and Kernel Stack</h3><p>The Linux kernel has its own page table with mappings for kernel memory pages. To be able to access its memory it must restore this page table. It does this by calling the <code>SWITCH_TO_KERNEL_CR3</code> macro. </p><blockquote><p><em>On x86, the CR3 control register is designated to store the address of the root of the page table. This is why the macro for switching page tables is called </em><code>SWITCH_TO_KERNEL_CR3.</code></p></blockquote><p>Separately, the kernel has its own fixed-size stack for executing kernel-side code. At this point the <code>rsp</code> register still points to the user space stack, so the kernel saves it in a scratch space and then restores its own stack pointer from a per-cpu variable.</p><p>When returning from the system call, the kernel restores the user page table and stack by reversing these operations. This code is not shown in the diagram but happens right after the &#8220;<code>call do_syscall_64&#8221;</code> step.</p><h3>Saving User Space Registers</h3><p>At this time, the CPU registers still contain the values they had while executing user space code. They will be overwritten when the kernel code executes, to avoid that from happening, the kernel saves the values on the kernel stack. After that it sanitizes those registers for security. All of this can be seen in boxes 3 and 4 in the diagram.</p><h3>Mitigations Against Speculative Execution Attacks</h3><p>The next three steps in the code are:</p><ul><li><p>Enabling IBRS (<a href="https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html">indirect branch restricted speculation</a>)</p></li><li><p>Untraining the return stack buffer</p></li><li><p>Clearing the branch history buffer</p></li></ul><p>These are there to mitigate against speculative execution attacks, such as <a href="https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)">spectre</a> (v1 and v2), and <a href="https://en.wikipedia.org/wiki/Retbleed">retbleed</a>. Speculative execution is an optimization in modern processors where they predict the outcome of branches in the code and speculatively execute instructions at the predicted path. When done accurately, this significantly improves the performance of the code. </p><p>However, vulnerabilities have been found where a malicious user program may train the branch predictor in ways that cause the CPU to speculatively execute along attacker&#8209;chosen paths inside the kernel. While these speculative paths do not change the logical flow of kernel execution, they can leak information through microarchitectural side&#8209;channels such as the cache. </p><p>These mitigations prevent user&#8209;controlled branch predictor state from influencing speculative execution in the kernel. But, these also come at a great performance cost. We will revisit these in detail later, when discussing the impact of system calls on branch prediction. </p><h3>Executing the System Call and Returning Back to User Space</h3><p>After all of this setup, the kernel finally calls the function <code>do_syscall_64</code>. This is where the actual system call gets invoked. We will not look inside of this function because our focus is on performance impact rather than a walkthrough of kernel code.</p><p>Once the system call is done, the <code>do_syscall_64</code> function returns. The kernel then restores the user space state, including registers, page table, and stack, and returns control back to user space. The following diagram shows the code after the <code>do_syscall_64</code> call to highlight this part.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yEHa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yEHa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png 424w, https://substackcdn.com/image/fetch/$s_!yEHa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png 848w, https://substackcdn.com/image/fetch/$s_!yEHa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png 1272w, https://substackcdn.com/image/fetch/$s_!yEHa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yEHa!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png" width="1200" height="1096.0989533777356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/467434e2-c5c8-4796-ac36-0b196637cc65_1051x960.png&quot;,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:960,&quot;width&quot;:1051,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:181388,&quot;alt&quot;:&quot;Actual x86-64 syscall exit path code from Linux kernel (entry_64.S), showing how the kernel restores user registers, page tables, and state before returning control to user space.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/172480326?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F467434e2-c5c8-4796-ac36-0b196637cc65_1051x960.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="Actual x86-64 syscall exit path code from Linux kernel (entry_64.S), showing how the kernel restores user registers, page tables, and state before returning control to user space." title="Actual x86-64 syscall exit path code from Linux kernel (entry_64.S), showing how the kernel restores user registers, page tables, and state before returning control to user space." srcset="https://substackcdn.com/image/fetch/$s_!yEHa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png 424w, https://substackcdn.com/image/fetch/$s_!yEHa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png 848w, https://substackcdn.com/image/fetch/$s_!yEHa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png 1272w, https://substackcdn.com/image/fetch/$s_!yEHa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea94fab-ccf9-41d2-ad43-8e7b10f9cbc0_1051x960.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Actual x86-64 syscall exit path code from Linux kernel (<code>entry_64.S</code>), showing how the kernel restores user registers, page tables, and state before returning control to user space.</figcaption></figure></div><p>Now that we have seen all the code the kernel executes to enter and exit a system call, we are ready to discuss the overheads introduced. There are two categories:</p><ul><li><p>Direct overhead from the code executed on entry and return.</p></li><li><p>Indirect overhead from microarchitectural side-effects (e.g. clearing the branch history buffer and return stack buffer).</p></li></ul><p>The major focus of this article is on discussing the indirect overhead induced due to system calls. But before we go any further, let&#8217;s do a quick benchmark to measure the impact of the direct overheads.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Writing these deep dives takes 100+ hours of work. If you find this valuable and insightful, please consider upgrading to a paid subscription to keep this work alive.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Direct Overhead of System Calls</h2><p>Direct overhead is largely fixed across all system calls, since each system call must perform the same entry and exit steps. We can do a rough measurement of this overhead with a simple benchmark by comparing the number of cycles taken to execute the&nbsp;<a href="https://man7.org/linux/man-pages/man3/clock_gettime.3.html">clock_gettime</a>&nbsp;system call in the kernel versus executing it in the user space.</p><p>The&nbsp;<code>clock_gettime</code>&nbsp;system call reads a system clock, such as the realtime clock (seconds since the Unix epoch) or the monotonic clock (seconds since kernel boot). It is very frequently used in software. For example, Java&#8217;s&nbsp;<code>System.currentTimeMillis()</code>&nbsp;and Python&#8217;s&nbsp;<code>time.time()</code>&nbsp;and&nbsp;<code>time.perf_counter()</code>&nbsp;use it under the hood.</p><p>Because system calls are expensive, Linux provides an optimization called&nbsp;<a href="https://en.wikipedia.org/wiki/VDSO">vDSO</a>&nbsp;(virtual dynamic shared object). This is a user-space shortcut for selected system calls where the kernel maps the system call's code into each process&#8217;s address space so that&nbsp;they can be executed like a normal function call, avoiding kernel entry.</p><p>So, we can create a benchmark that measures the time taken to execute <code>clock_gettime</code> in the user space using vDSO and compare it against the time taken inside the kernel using the&nbsp;<a href="https://man7.org/linux/man-pages/man2/syscall.2.html">syscall</a>&nbsp;interface. The following code shows the benchmarking program. </p><pre><code>#define _GNU_SOURCE
#include &lt;sys/syscall.h&gt;
#include &lt;unistd.h&gt;
#include &lt;stdint.h&gt;
#include &lt;stdio.h&gt;
#include &lt;time.h&gt;
#include &lt;x86intrin.h&gt;

int main() {
  const int ITERS = 100000;
  uint32_t cpuid;
  struct timespec ts;
  
  // Warm up both syscall and libc versions
  for (int i = 0; i &lt; 10000; i++) {
    syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &amp;ts);
    clock_gettime(CLOCK_MONOTONIC, &amp;ts);
  }

  // Test 1: Direct syscall interface
  _mm_lfence();
  uint64_t start1 = __rdtsc();
  long sink1 = 0;
  for (int i = 0; i &lt; ITERS; i++) {
    long ret = syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &amp;ts);
    sink1 += ret + ts.tv_sec + ts.tv_nsec; // use the results to prevent optimization
  }
  uint64_t end1 = __rdtscp(&amp;cpuid);
  _mm_lfence();

  // Test 2: libc clock_gettime
  _mm_lfence();
  uint64_t start2 = __rdtsc();
  long sink2 = 0;
  for (int i = 0; i &lt; ITERS; i++) {
    int ret = clock_gettime(CLOCK_MONOTONIC, &amp;ts);
    sink2 += ret + ts.tv_sec + ts.tv_nsec; // use the results to prevent optimization
  }
  uint64_t end2 = __rdtscp(&amp;cpuid);
  _mm_lfence();

  // Prevent dead-code removal
  if (sink1 == 42 || sink2 == 42) fprintf(stderr, "x\n");

  double cycles_per_syscall = (double)(end1 - start1) / ITERS;
  double cycles_per_libc = (double)(end2 - start2) / ITERS;
  
  printf("Direct syscall cycles per call ~ %.1f\n", cycles_per_syscall);
  printf("Libc wrapper cycles per call ~ %.1f\n", cycles_per_libc);
  printf("Difference ~ %.1f cycles (%.1f%% %s)\n", 
         cycles_per_libc - cycles_per_syscall,
         100.0 * (cycles_per_libc - cycles_per_syscall) / cycles_per_syscall,
         cycles_per_libc &gt; cycles_per_syscall ? "slower" : "faster");
  
  return 0;
}
</code></pre><blockquote><p><strong>A note on rdtsc</strong>: Normally, you would use <code>clock_gettime()</code> to measure timings. But here we are benchmarking <code>clock_gettime()</code> itself, so we need something more precise. <code>rdtsc</code> is an x86 instruction that reads the value of a 64&#8209;bit timestamp counter (TSC) in the CPU. This counter ticks at a fixed frequency (e.g. 2.3 GHz on my machine). By measuring its value before and after, we can know how many cycles an operation took.</p></blockquote><p>The program produces the following output on my laptop:</p><pre><code>&#10140; ./clock_gettime_comparison 
Direct syscall cycles per call ~ 1428.8
Libc wrapper cycles per call ~ 157.0
Difference ~ -1271.9 cycles (-89.0% faster)</code></pre><p>The vDSO version is an order of magnitude faster, showing how costly the syscall entry/exit path is compared to a plain function call. </p><blockquote><p><em>We should take this estimate with a grain of salt because in the benchmark we are measuring inside a loop, and the performance of the loop itself can suffer from the indirect side&#8209;effects of entering and exiting the kernel, which is our next topic.</em></p></blockquote><p>While this benchmark isolates direct overhead, real&#8209;world performance also suffers from indirect costs due to CPU microarchitectural effects. Let&#8217;s explore those next.</p><h2>Indirect Overhead of System Calls</h2><p>System calls also incur indirect costs, because the kernel&#8217;s entry path disturbs the CPU&#8217;s microarchitectural state. These side-effects impact the microarchitectural state of the process in the CPU and the loss of this state can introduce transient degradation in the performance of the user space code.</p><p>At the microarchitecture level, the CPU implements several optimizations such as instruction pipelining, superscalar execution and branch prediction. These are designed to improve the instruction throughput of the program, i.e., how many instructions the CPU can execute each cycle. A higher throughput means faster program execution.</p><p>It can take a few cycles for the CPU to get to a steady state where these optimizations start to pay off, but making system calls can lead to the loss of this state and a drop in the performance of the program.</p><p>We will cover the indirect costs of system calls by discussing the different components of the microarchitecture that are impacted, starting from the instruction pipeline, followed by the branch predictor buffers.</p><h3>Effect on the Instruction Pipeline</h3><p>We didn&#8217;t see any code in the Linux kernel which touches the instruction pipeline, rather this is done by the CPU itself. Before switching to kernel mode, the CPU drains the instruction pipeline to ensure that the user space code does not interfere when the kernel code executes. This impacts the performance of the user space code when the system call returns. To understand how, we need to revisit the basics of instruction pipelining.</p><p>CPUs have multiple execution resources, such as registers, execution units, load and store buffers etc. To use all of these effectively, it is necessary that they executes multiple program instructions in parallel, this is made possible through instruction pipelining and superscalar architecture.</p><p>Instruction pipelining breaks down the execution of an instruction into several stages, like the assembly pipeline in a factory. An instruction moves from one stage to the next in each CPU cycle, enabling the CPU to start executing one new instruction each cycle. </p><p>For example, the following diagram shows a 5-stage pipeline. You can see that it takes five instructions for the pipeline to fill completely, and for the first instruction to retire. After this stage, the pipeline is in a steady state, and it can provide a throughput of one instruction per cycle. This is a very simplistic example, modern x86 processors have much deeper pipelines, e.g. 20-30 cycles. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gGGF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gGGF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png 424w, https://substackcdn.com/image/fetch/$s_!gGGF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png 848w, https://substackcdn.com/image/fetch/$s_!gGGF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png 1272w, https://substackcdn.com/image/fetch/$s_!gGGF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gGGF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png" width="607" height="133" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:133,&quot;width&quot;:607,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Example of a simple 5-stage instruction pipeline (Fetch, Decode, Memory Read, ALU, Memory Write), showing how multiple instructions overlap in execution across cycles.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Example of a simple 5-stage instruction pipeline (Fetch, Decode, Memory Read, ALU, Memory Write), showing how multiple instructions overlap in execution across cycles." title="Example of a simple 5-stage instruction pipeline (Fetch, Decode, Memory Read, ALU, Memory Write), showing how multiple instructions overlap in execution across cycles." srcset="https://substackcdn.com/image/fetch/$s_!gGGF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png 424w, https://substackcdn.com/image/fetch/$s_!gGGF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png 848w, https://substackcdn.com/image/fetch/$s_!gGGF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png 1272w, https://substackcdn.com/image/fetch/$s_!gGGF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cbbeb2e-ed04-4f44-bb88-683713a2a740_607x133.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Example of a simple 5-stage instruction pipeline (Fetch, Decode, Memory Read, ALU, Memory Write), showing how multiple instructions overlap in execution across cycles.</figcaption></figure></div><p>Modern processors are also superscalar. They have multiple such pipelines to issue and execute multiple new instructions each cycle. For example, a 4-wide processor can start executing up to 4 new instructions each cycle and it can retire up to 4 new instructions each cycle. If such a CPU has a pipeline depth of 20, then it can have up to 80 instructions in flight in a steady state.</p><p>This means that the processor is normally busy executing dozens of user-space instructions in parallel. But when a system call occurs, the CPU must first ensure all pending user instructions finish before it can jump into the kernel.</p><p>So, when the system call returns back to the user space, you can imagine that the instruction pipeline is almost empty because the CPU did not allow the instructions following syscall to enter the pipeline. At this point the pipeline has to start almost from scratch, and it can again take a while until the pipeline reaches a steady throughput again.</p><p>Contrast this with the scenario where no system call occurs: the CPU remains in its steady state, pipelines stay full, and instruction throughput stays high. In other words, a single system call can derail the momentum of dozens of in&#8209;flight instructions.</p><blockquote><p>On x86-64, the <a href="https://www.felixcloutier.com/x86/syscallhttps://www.felixcloutier.com/x86/syscall">syscall instruction</a> is used to execute a system call. The Intel manual has this note about it:</p><p><em>&#8220;<strong>Instruction ordering:</strong> Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible).&#8221;</em></p><p>This confirms that the CPU drains the pipeline before transferring control to the kernel.</p></blockquote><h2>Effect on Branch Prediction</h2><p>The next major indirect impact system calls have on user space performance is through the clearing of the branch predictor buffers. These can be grouped as three mitigations the kernel applies that we saw in the kernel code above.</p><ul><li><p>Clearing the branch history buffer</p></li><li><p>Untraining the return stack buffer</p></li><li><p>Enabling/disabling the IBRS</p></li></ul><p>The first two of these have a profound indirect impact on user code performance. The enabling/disabling of IBRS does not impact user space performance, rather only adds a direct overhead to syscall execution. However, I will discuss this here because logically it goes with the topic of branch prediction. In this section, we will first review branch prediction and then talk about each of these.</p><h3>Understanding Branch Prediction</h3><p>Instruction pipelining and superscalar execution enables CPUs to execute multiple instructions in parallel, and they execute these instructions out-of-order.</p><p>When the CPU comes across a branching instruction, such as an if condition, it may not know the result of the condition because those set of instructions may still be executing. If the CPU waits for those instructions to finish to know the branch outcome, the pipeline can be stalled for a long time, which means poor performance.</p><p>To optimize this, the CPUs come with a feature called the branch predictor that can predict the target address of these branches based on past branching patterns. This enables the CPU to speculatively execute the instruction from the predicted address and stay busy. If the prediction turns out to be correct, then the CPU saves a lot of cycles and instruction throughput remains high.</p><p>However, when the prediction is wrong, the CPU has to discard the results of these speculatively executed instructions, flush the instruction pipeline, and fetch the instructions from the right address. This can cost 20-30 cycles on modern CPUs (depending on the depth of the pipeline).</p><h3>Clearing the Branch History Buffer</h3><p>We saw in the kernel code that it invokes the macro <code>CLEAR_BRANCH_HISTORY</code> which clears the branch history buffer (BHB).</p><p>The BHB is a buffer in the branch predictor that learns the branching history patterns at a global level. This helps the branch predictor predict the outcomes of deeply nested and complex branching patterns more accurately. You can think of it as remembering the last few intersections you passed to better predict where you&#8217;ll turn next.</p><p>But it can take a while for the BHB to collect enough history for the branch predictor to generate accurate predictions. So, whenever you execute a system call in your code, if the kernel clears the BHB, you lose all that state. As a result, your user space code may experience an increased rate of branch mispredictions after returning from the system call. This can significantly degrade the performance of user space applications.</p><blockquote><p><strong>Note on recent CPUs:</strong> This clearance of BHB was added to the kernel as a mitigation against speculative execution attacks, such as Spectre V2. In recent years, CPU vendors have introduced hardware mitigations which obviate the need for the kernel to clear the BHB. For example, the Intel advisory says that if your CPU comes with the "<a href="https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/speculative-execution-side-channel-mitigations.html">enhanced IBRS</a>" (we discuss IBRS below) feature, then there is no need to clear the BHB. So, not all CPUs suffer degraded performance due to this.</p><p>If you want to check whether your kernel clears the BHB, you can check the <a href="https://man7.org/linux/man-pages/man1/lscpu.1.html">lscpu</a> output. If you see &#8220;<code>BHI SW loop</code>&#8221; in the vulnerability section, it means that the kernel clears the BHB during system calls.</p><p>Also, if you believe that you will never execute untrusted code, you can manually disable the mitigation through a boot time flag. </p></blockquote><h3>Untraining the Return Stack Buffer</h3><p>Next in the line is untraining of the return stack buffer (RSB). The RSB is another buffer in the branch predictor that is used to predict the return address of function calls.</p><p>But why does it need to predict the return address? It again comes down to out-of-order execution. The CPU may want to execute the return instruction even though other instructions of the function may still be executing. At this point, the CPU does not know the return address. The return address is stored on the process&#8217;s stack memory, but accessing memory is slow. So, the CPU uses the RSB to predict the return address.</p><p>On every function call, the CPU pushes the return address into the RSB. While executing the return instruction, the CPU pops this buffer and jumps to that address. Because this buffer right in the CPU, it is very fast to access.</p><p>However, this also led to vulnerabilities such as <a href="https://en.wikipedia.org/wiki/Retbleed">Retbleed</a>. In this attack, carefully chosen user&#8209;space code could influence how the CPU predicted kernel return addresses, so that the CPU speculatively executed instructions at the wrong place inside the kernel. While this speculative execution did not change the actual kernel logic, it could leak information through side&#8209;channels. To prevent this, the kernel untrains the RSB on entering the kernel.</p><p>Untraining the RSB impacts the performance of the user space code when the system call returns because now the RSB does not have the state. Without a trained RSB, the CPU falls back to a slower indirect branch predictor which may have higher chances of making a mistake.</p><blockquote><p><strong>Note on CPUs Impacted</strong>: The kernel does not clear the RSB for all the CPU models. The vulnerabilities that require clearing the RSB (retbleed and <a href="https://docs.kernel.org/admin-guide/hw-vuln/srso.html">SRSO</a>) have only been known to impact AMD CPUs. Also, if your CPU has hardware mitigations, such as enhanced IBRS, then the kernel does not perform this (the <code>UNTRAIN_RET</code> macro becomes a noop on such devices).</p><p>Again, the kernel allows you to disable the mitigation but do this only when you are sure that you will never run untrusted code.</p></blockquote><h3>IBRS Entry and Exit</h3><p>Finally, let&#8217;s talk about indirect branch restricted speculation (IBRS). We saw that the kernel executes <code>IBRS_ENTER</code> on entering the syscall and <code>IBRS_EXIT</code> while returning back. So, what is IBRS and what is its impact on performance?</p><p>IBRS is a hardware feature which restricts the indirect branch predictor when executing in kernel mode. Effectively, it prevents the user space training of the indirect branch predictor from having any effect on indirect branch prediction inside the kernel.</p><p>Indirect branches are those branches in code where the target address is not part of the instruction but is known only at runtime. A common example is calling through a function pointer in C (e.g., <code>(*fp)()</code>), where the actual target depends on which function the pointer holds at that moment. Another example is a virtual function call in C++ or a jump table generated for a large switch statement. In all these cases, the CPU can use the indirect branch predictor to guess the likely target address based on past branching history.</p><p>When the Spectre and related vulnerabilities were found, one of the attack vectors involved tricking the CPU into mispredicting indirect branch targets inside the kernel. By influencing the branch predictor state from user space, attackers could cause the CPU to speculatively execute instructions at unintended locations in the kernel. It could lead to leak of sensitive kernel data through side-channels such as the cache.</p><p>The mitigation for this attack is to restrict the indirect branch predictor when executing in kernel mode via the IBRS mechanism. Enabling and disabling IBRS itself doesn&#8217;t have any impact on the performance of the user space code, but the act of executing extra instructions to do this during each system call adds overhead.</p><p>However, recent CPUs have a feature called enhanced IBRS which automatically enables IBRS when switching to kernel mode. On such devices, the <code>IBRS_ENTER</code> and <code>IBRS_EXIT</code> macros in the kernel become a noop.</p><div><hr></div><p>Together, these mitigations explain why the indirect cost of system calls can vary significantly across CPU generations and configurations. In practice, this means a single system call can not only drain the pipeline but also leave the branch predictor partially blind, forcing the CPU to relearn patterns and slowing down your code until it recovers. The important point is that the true cost of a system call is not just the handful of instructions executed in the kernel, but also the disruption it causes to the CPU&#8217;s optimizations. This makes system calls far more expensive than they look on the surface, and why minimizing them can be such a powerful optimization strategy. However, slowly CPU vendors are adding hardware mitigations which is making these software-based mitigations obsolete and reducing the performance overheads.</p><div><hr></div><h2>Practical Ways to Reduce System Calls</h2><p>So what can you do as a developer? A few practical ideas:</p><ul><li><p><strong>Use vDSO</strong>: For calls like <code>clock_gettime</code>, prefer the vDSO path to avoid kernel entry.</p></li><li><p><strong>Cache cheap values</strong>: Some values obtained through system calls rarely change during a program&#8217;s lifetime. If you can safely cache them once and reuse, you can avoid repeated system calls.</p></li><li><p><strong>Optimize I/O System Calls</strong>: There are various strategies and patterns that you can use to optimize I/O related system calls. For example:</p><ul><li><p>Prefer buffered I/O instead of raw read/write system calls</p></li><li><p>Use scatter/gather operations like <code>readv</code>/<code>writev</code> to batch multiple buffers</p></li><li><p>If your system allows, use <code>mmap</code> instead of repeated read/write calls.</p></li></ul></li><li><p><strong>Batch operations</strong>: Interfaces like <a href="https://man7.org/linux/man-pages/man7/io_uring.7.html">io_uring</a> let you submit many I/O requests to a shared queue in user space, which the kernel can then process in batches. This reduces the number of times your program needs to cross into the kernel.</p></li><li><p><strong>Push work into the kernel</strong>: With <a href="https://ebpf.io/">eBPF</a> it is increasingly possible to move parts of application logic into the kernel itself. Beyond traditional use cases like packet filtering, newer frameworks let you offload tasks such as policy enforcement, monitoring, and even parts of data processing. In these cases, instead of making repeated system calls, the user program loads small programs into the kernel that run directly when events occur, avoiding crossings altogether.</p></li></ul><p>None of these tricks are magic, but they all follow the same principle: fewer crossings means less disruption. Every time you avoid a system call, you&#8217;re saving not just a function call into the kernel, but also the hidden costs of the CPU recovering its state.</p><div><hr></div><h2>Wrapping Up</h2><p>We&#8217;ve gone through a lot of detail for what looks like just a small stretch of kernel code. The point is simple: the cost of a system call goes beyond the small number of instructions that execute in the kernel. It disrupts the CPU&#8217;s rhythm by draining pipelines, resetting predictors, and forcing everything to start fresh. That&#8217;s why they show up as hot spots in profiles and why people try so hard to avoid them.</p><p>The strategies we looked at earlier (vDSO, caching, optimizing I/O, batching with io_uring, and pushing work into the kernel) are all ways to cut down on this disruption. They won&#8217;t remove the cost of system calls entirely, but they can make the difference between code that spends most of its time waiting on the kernel and code that keeps the CPU running at full speed.</p><p>System calls are the interface to the kernel and the hardware. They are necessary, but they come at a cost. Understanding and managing that cost is a key part of writing faster software.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/what-makes-system-calls-expensive?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/what-makes-system-calls-expensive?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>If you read till here, there is a good chance you find this insightful. This work is supported by readers such as you. Consider becoming a paid subscriber to keep this going.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How to Leverage the CPU’s Micro-Op Cache for Faster Loops]]></title><description><![CDATA[Measuring, analyzing, and optimizing loops using Linux perf, Top-Down Microarchitectural Analysis, and the CPU&#8217;s micro-op cache]]></description><link>https://blog.codingconfessions.com/p/how-to-leverage-the-cpus-micro-op</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/how-to-leverage-the-cpus-micro-op</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Fri, 15 Aug 2025 05:37:44 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/171026971/ae2b0064d356aecff075b296b8fcbac4.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Performance engineering can be deeply mysterious. Sometimes adding a line of code can make your program execute 2&#215; faster. These behaviors are impossible to explain unless you understand the processor microarchitecture and compiler optimization tricks.</p><p>In this video, I show how adding a single line of code to a slow-running program makes it run 2&#215; faster. You&#8217;ll see how this one change helped the compiler arrange instructions in memory so the CPU could fetch them from its <em>micro-op cache</em> instead of decoding them every time, a huge win for hot loops.</p><p>On Intel processors, this micro-op cache is known as the <strong>Decoded Stream Buffer (DSB)</strong>. It&#8217;s designed specifically to accelerate hot paths in your code by caching pre-decoded instructions, so the CPU can skip the expensive fetch/decode stages entirely. Understanding when and how the DSB kicks in is key to unlocking this kind of speedup.</p><p>If you&#8217;re curious about controlling the hardware and squeezing out every last ounce of performance, you should watch the video.</p><p>Along the way, we&#8217;ll cover:</p><ul><li><p>Measuring performance with <strong>Linux perf</strong></p></li><li><p>Using <strong>Top-Down Microarchitectural Analysis (TMA)</strong> to pinpoint hardware bottlenecks</p></li><li><p>Understanding what the DSB is and when it&#8217;s used</p></li><li><p>Forcing the compiler to take advantage of it with <strong>code alignment</strong> and <strong>profile-guided optimization</strong></p></li></ul><p>The result is 2x faster loop and a set of techniques that you can use for debugging and optimizing your own loops.</p><div><hr></div><h2><strong>What&#8217;s Next</strong></h2><p>In this video, I showed how one condition affects whether the processor can use the DSB, and fixing it cut the bottleneck roughly in half. But if you run a top-down analysis again, you&#8217;ll still see some DSB stalls. That&#8217;s because there are other conditions that also influence DSB usage. In the next video, I&#8217;ll dive into one of those remaining conditions and show how to eliminate more of the bottleneck. In the meanwhile, why don&#8217;t you experiment and see if you can identify and fix it yourself?</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/how-to-leverage-the-cpus-micro-op?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/how-to-leverage-the-cpus-micro-op?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p>
      <p>
          <a href="https://blog.codingconfessions.com/p/how-to-leverage-the-cpus-micro-op">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Big O vs Hardware: Better Complexity ≠ Better Performance]]></title><description><![CDATA[Why Your O(log n) Algorithm Might Lose to O(n)]]></description><link>https://blog.codingconfessions.com/p/big-o-vs-hardware</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/big-o-vs-hardware</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Sun, 03 Aug 2025 18:37:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6Zs4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Zs4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Zs4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Zs4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Zs4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Zs4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Zs4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91083,&quot;alt&quot;:&quot;Cover image: Big O vs Hardware&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cover image: Big O vs Hardware" title="Cover image: Big O vs Hardware" srcset="https://substackcdn.com/image/fetch/$s_!6Zs4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Zs4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Zs4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Zs4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ed3f569-5b2a-4362-ab67-722e4b26fe8d_1536x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Big O vs Hardware</figcaption></figure></div><p>In algorithm design, we often rely on time complexity to compare solutions. It tells us how the work done by an algorithm grows with the size of its input. But real-world performance also depends on how well the code runs on hardware.</p><p>In a previous article, we explored the <a href="https://blog.codingconfessions.com/p/one-law-to-rule-all-code-optimizations">Iron Law of performance</a>, which states that a program&#8217;s performance on the hardware depends on two factors:</p><ul><li><p><strong>Instruction count:</strong> fewer instructions usually mean faster execution.</p></li><li><p><strong>Instructions per cycle (IPC):</strong> the more instructions a CPU can retire per cycle, the better.</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Performance} \\propto \\frac{IPC}{\\text{Instruction count} \\times {\\text{Clock Cycle Time}}}&quot;,&quot;id&quot;:&quot;SLXDPJTSNQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>By definition, algorithms with better time complexity tend to do less work. But they don&#8217;t always perform better if that work is harder for the CPU. A lower instruction count doesn&#8217;t help if each instruction is expensive or slows down IPC.</p><p>In this article, we&#8217;ll see a concrete example of this tradeoff. We&#8217;ll compare three algorithms for computing the greatest common divisor (GCD), study their time complexities, benchmark their real-world performance, and use the Iron Law to understand what&#8217;s really going on. As we will see, having a better time complexity is not a guarantee of better performance, a hardware friendly implementation also matters.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Producing articles like this takes over 100 hours of research and writing. If you find my work insightful, consider supporting it by becoming a paid subscriber.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Euclid&#8217;s Subtraction-based Algorithm for Computing GCD</h2><p>The first algorithm we will study is Euclid's subtraction-based algorithm for computing the GCD of two integers. </p><p>The GCD of two integers a and b is defined as the greatest integer that divides both a and b. For example, the GCD for 12 and 8 is 4. Euclid's subtraction-based algorithm for computing this is shown in the following code block.</p><pre><code>gcd(long a, long b)
{
    while (a != b) {
        if (a &gt; b) {
            a -= b;
        } else {
            b -= a;
        }
    }
    return a;
}</code></pre><p>The algorithm keeps removing the smaller number from the larger one until both of them become equal. Let&#8217;s trace an example with a = 84, b = 18.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rt9A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rt9A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png 424w, https://substackcdn.com/image/fetch/$s_!Rt9A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png 848w, https://substackcdn.com/image/fetch/$s_!Rt9A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png 1272w, https://substackcdn.com/image/fetch/$s_!Rt9A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rt9A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png" width="734" height="308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:308,&quot;width&quot;:734,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31711,&quot;alt&quot;:&quot;A trace of the steps taken by the subtraction-based GCD algorithm for a=84, b=18&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A trace of the steps taken by the subtraction-based GCD algorithm for a=84, b=18" title="A trace of the steps taken by the subtraction-based GCD algorithm for a=84, b=18" srcset="https://substackcdn.com/image/fetch/$s_!Rt9A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png 424w, https://substackcdn.com/image/fetch/$s_!Rt9A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png 848w, https://substackcdn.com/image/fetch/$s_!Rt9A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png 1272w, https://substackcdn.com/image/fetch/$s_!Rt9A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde92ecfd-2bf3-4af4-9667-86a15bf1c9a8_734x308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A trace of the steps taken by the subtraction-based GCD algorithm for a=84, b=18</figcaption></figure></div><p>As you can see in the above table, the algorithm converges in 7 steps.</p><p>Now, let&#8217;s think about the worst case time complexity of this algorithm. At each step the algorithm converges towards the GCD value by a step size that is equal to the difference between a and b. The smallest step possible is 1 when either <code>a=1 </code>or <code>b=1</code>. In that case, the algorithm will take as many steps as <code>max(a, b)</code>, giving us the worst case time complexity as <code>O(max(a, b))</code></p><p>In simpler terms, the algorithm grows linearly as the difference between a and b grows. For cases, when a and b are nearby, the algorithm converges quickly but when a and b are far apart, it will take a large number of steps.</p><p>This subtraction-based approach works, but there's a version that converges faster using division instead of repeated subtraction. Let&#8217;s take a look at that.</p><h2>The Modulo-based Euclidean Algorithm for GCD</h2><p>A more efficient variation of the Euclidean algorithm replaces repeated subtraction with division, reducing the number of steps required. The following snippet shows the code.</p><pre><code>gcd(long a, long b)
{
    while (b != 0) {
        long t = b;
        b = a % b;
        a = t;
    }
    return a;
}</code></pre><p>We can trace the same input to see how this version behaves.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K6bT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K6bT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png 424w, https://substackcdn.com/image/fetch/$s_!K6bT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png 848w, https://substackcdn.com/image/fetch/$s_!K6bT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png 1272w, https://substackcdn.com/image/fetch/$s_!K6bT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K6bT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png" width="734" height="212" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:212,&quot;width&quot;:734,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:20074,&quot;alt&quot;:&quot;A trace of the steps taken by the module-based GCD algorithm for a=84, b=18&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A trace of the steps taken by the module-based GCD algorithm for a=84, b=18" title="A trace of the steps taken by the module-based GCD algorithm for a=84, b=18" srcset="https://substackcdn.com/image/fetch/$s_!K6bT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png 424w, https://substackcdn.com/image/fetch/$s_!K6bT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png 848w, https://substackcdn.com/image/fetch/$s_!K6bT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png 1272w, https://substackcdn.com/image/fetch/$s_!K6bT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f7cff31-288b-4c9d-bc3a-df723afe1c46_734x212.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">A trace of the steps taken by the module-based GCD algorithm for a=84, b=18</figcaption></figure></div><p>In this case, the algorithm converges in just three steps as compared to seven taken by the subtraction-based algorithm. It is easy to see that at each step the algorithm converges towards the solution by performing a division between a and b, which leads to the worst case time complexity of <code>O(log(max(a, b)))</code>.</p><p>These time complexities give us a theoretical bound of the scale of these algorithms but the actual performance can be measured only by running on real hardware. Let&#8217;s run both algorithms on large inputs to see how their time complexity translates to actual performance.</p><h2>Benchmark #1: Huge Inputs</h2><p>As an example, let&#8217;s compare the performance of the two algorithms on the input <code>a=1000000000</code> and <code>b=9223372036854775503</code>. Following figure shows the timing and other high-level performance metrics using the <a href="https://perfwiki.github.io/main/">Linux perf</a> tool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dnmq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dnmq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png 424w, https://substackcdn.com/image/fetch/$s_!Dnmq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png 848w, https://substackcdn.com/image/fetch/$s_!Dnmq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png 1272w, https://substackcdn.com/image/fetch/$s_!Dnmq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dnmq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png" width="945" height="520" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:520,&quot;width&quot;:945,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85908,&quot;alt&quot;:&quot;perf stat output for the subtraction-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="perf stat output for the subtraction-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503" title="perf stat output for the subtraction-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503" srcset="https://substackcdn.com/image/fetch/$s_!Dnmq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png 424w, https://substackcdn.com/image/fetch/$s_!Dnmq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png 848w, https://substackcdn.com/image/fetch/$s_!Dnmq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png 1272w, https://substackcdn.com/image/fetch/$s_!Dnmq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ab87db-5788-48c5-b59d-24e3f0f7a9e6_945x520.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">perf stat output for the subtraction-based GCD algorithm for the input: <code>a=1000000000</code> and <code>b=9223372036854775503</code></figcaption></figure></div><p>The subtraction-based algorithm took 63,34,37,507 steps and ran in 2,230.71 milliseconds, consuming 9.28 billion CPU cycles. It executed 55.35 billion instructions at a rate of 5.96 instructions per cycle.</p><p>Now, let&#8217;s run the modulo-based algorithm and see how that performs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xjjy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xjjy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png 424w, https://substackcdn.com/image/fetch/$s_!Xjjy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png 848w, https://substackcdn.com/image/fetch/$s_!Xjjy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png 1272w, https://substackcdn.com/image/fetch/$s_!Xjjy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xjjy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png" width="945" height="545" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:545,&quot;width&quot;:945,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81501,&quot;alt&quot;:&quot;perf stat output for the modulo-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="perf stat output for the modulo-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503" title="perf stat output for the modulo-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503" srcset="https://substackcdn.com/image/fetch/$s_!Xjjy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png 424w, https://substackcdn.com/image/fetch/$s_!Xjjy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png 848w, https://substackcdn.com/image/fetch/$s_!Xjjy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png 1272w, https://substackcdn.com/image/fetch/$s_!Xjjy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1906da83-3b72-4825-bb5f-8ea0e7cf1023_945x545.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">perf stat output for the modulo-based GCD algorithm for the input: a=1000000000 and b=9223372036854775503</figcaption></figure></div><p>As expected, the modulo-based algorithm is dramatically faster, converging in just 22 steps. The CPU executes this in just 0.28 milliseconds, roughly 10,000x faster than the subtraction-based algorithm. It executed only about 1 million instructions and cycles.</p><p>This is algorithmic efficiency in action. However, it is not the ultimate truth. The performance of these algorithms also depends on the efficiency of the hardware-level operations. Let&#8217;s talk about that for a moment.</p><h2>Cost of Integer Add vs Integer Division in Hardware</h2><p>Time complexity tells us how the CPU workload scales with input size. Specifically, the work in the case of these two algorithms refers to subtraction and modulo operations. Inside the CPU, subtraction is performed by the execution unit that performs addition and modulo translates into integer division which is handled by a different execution unit.</p><p>So, the performance of these algorithms also depends on the efficiency of these fundamental operations. When talking about the efficiency of CPU instructions like these, we usually care about two aspects:</p><ul><li><p><strong>Latency</strong>: How many cycles does the CPU take to execute that instruction. The lower the latency, the better.</p></li><li><p><strong>Throughput</strong>: How many of those instructions can the CPU execute each cycle. A processor may be able to execute more than one instruction of a certain type per cycle. Some instructions are pipelined. Even if one takes multiple cycles to finish, the CPU can begin executing another of the same type in the meantime. In addition to that, the processor may have multiple execution ports that can execute that operation in parallel.</p></li></ul><p>On Intel Skylake processors, an add instruction has a latency of 1 cycle and a throughput of 4 operations per cycle because the hardware has four execution ports capable of performing integer addition. Often, the compiler takes advantage of this high throughput by unrolling loops when it notices the loop involves addition operations.</p><p>On the other hand, integer division is very expensive. On Intel Skylake, it has a latency of 42-95 cycles. Unlike integer addition, there is only one execution port capable of performing integer division, as a result you can execute only one integer division operation every 24-90 cycles (as per <a href="https://www.agner.org/optimize/">Agner Fog&#8217;s optimization manual</a>).</p><p>This contrast in the performance of these operations brings the Iron Law into the picture. The subtraction-based algorithm will always execute a higher number of instructions, but it will also have a better IPC. On the other hand, the modulo-based algorithm will execute less number of instructions, but it will have a very poor IPC. Performance depends on the tradeoff between instruction count and IPC. Let&#8217;s do another benchmark to see this play out.</p><h2>Benchmark #2: Small Inputs</h2><p>In this next benchmark, we use inputs that are close in value to observe how IPC can dominate when instruction counts are similar. The following table summarizes the performance of these two algorithms for the input where <code>a=130000</code>, <code>b=13</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6xJz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6xJz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png 424w, https://substackcdn.com/image/fetch/$s_!6xJz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png 848w, https://substackcdn.com/image/fetch/$s_!6xJz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png 1272w, https://substackcdn.com/image/fetch/$s_!6xJz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6xJz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png" width="852" height="260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:260,&quot;width&quot;:852,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33708,&quot;alt&quot;:&quot;A side-by-side comparison of the subtraction and modulo based GCD algorithms for the input: a=130000, b=13. The numbers were obtained using perf stat&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A side-by-side comparison of the subtraction and modulo based GCD algorithms for the input: a=130000, b=13. The numbers were obtained using perf stat" title="A side-by-side comparison of the subtraction and modulo based GCD algorithms for the input: a=130000, b=13. The numbers were obtained using perf stat" srcset="https://substackcdn.com/image/fetch/$s_!6xJz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png 424w, https://substackcdn.com/image/fetch/$s_!6xJz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png 848w, https://substackcdn.com/image/fetch/$s_!6xJz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png 1272w, https://substackcdn.com/image/fetch/$s_!6xJz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5265d1d7-7631-4805-8c2a-5a40088bb7ab_852x260.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A side-by-side comparison of the subtraction and modulo based GCD algorithms for the input: a=130000, b=13. The numbers were obtained using perf stat</figcaption></figure></div><p>The subtraction-based algorithm executes 9,999 steps as compared to the modulo-based algorithm that converges in a single step. Despite taking 10,000 steps, the subtraction-based algorithm finishes 1 millisecond faster.</p><p>If we apply the Iron Law lens, we can see that the subtraction-based algorithm executed a slightly higher number of instructions, but it had a slightly better IPC as well which tipped the performance in its favor.</p><p>This isn&#8217;t to say algorithmic complexity doesn&#8217;t matter, at large scales, it absolutely does. But, a lot of the workload may never hit those scales and the implementation needs to take that into account. For example, many sorting routines use quicksort for large arrays, but switch to insertion sort for smaller sizes (e.g. fewer than 5 elements) because it&#8217;s simpler and faster in that regime. Similar strategy can be employed here. For values with smaller gap between them, subtraction-based algorithm can be preferred, while for values with larger difference, modulo-based algorithm can be used.</p><p>An alternative to switching between algorithms is to use an implementation that&#8217;s inherently hardware-friendly. In 1967, Stein designed such an algorithm that takes advantage of binary representation of integers and leverages bit shift operations that are very fast at the hardware-level. Let&#8217;s first understand how it works before comparing performance.</p><h2>Stein&#8217;s Binary Algorithm for GCD</h2><p>Stein designed this algorithm based on certain observations about GCD computation. These are as follows:</p><ol><li><p><code>gcd(a, b) = gcd(b, a)</code></p></li><li><p><code>gcd(0, b) = b</code>, and <code>gcd(a, 0) = a</code></p></li><li><p><code>gcd(2a, 2b) = 2 * gcd(a, b)</code>, i.e., if a and b are even then we can compute the GCD of their halves and then multiply the result by 2.</p></li><li><p><code>gcd(a, 2b) = gcd(a, b)</code>, i.e., if b is odd then 2 is not a common divisor.</p></li><li><p><code>gcd(a, b) = gcd(a, b - a)</code> when a and b are odd and <code>a &lt; b</code>.</p></li></ol><p>These observations lead to a recursive algorithm that reduces the inputs until <code>b == 0</code>. However, unlike the modulo-based algorithm, this algorithm can be highly optimized for real hardware. For example, most implementations use the following optimization tricks.</p><ul><li><p>Instead of recursion, iteration is preferred.</p></li><li><p>Every step of the loop has to reduce a and b to odd values. This requires repeated division by 2. It turns out division by 2 can be done by the right bit shift operation which is much faster to perform than division.</p></li><li><p>The algorithm needs to check if the numbers are odd or not. Dividing by 2 and checking the remainder is not efficient because division is slow and introduces branches. A more efficient trick is to always ensure that the least significant bit (LSB) of these numbers is 1. This can be done by counting the number of trailing zero bits in the number, and right shifting by that many bits. Most processors have a dedicated instruction to count the number of trailing zero bits, so this is extremely cheap to do.</p></li></ul><p>The following code block shows the C implementation of this algorithm (it uses the GCC builtin <code>__builtin_ctzl</code> to count the number of trailing zero bits). </p><pre><code>gcd(long a, long b)
{
    // base conditions
    if (a == 0)
        return b;
    if (b == 0)
        return a;

    // gcd(2^i * a, 2^j * b) = 2^k * gcd(a, b)
    // where k = min(i, j)
    int k = __builtin_ctzl(a | b);

    // make a odd
    a &gt;&gt;= __builtin_ctzl(a);
    while (b != 0) {
        // make b odd
        b &gt;&gt;= __builtin_ctzl(b);
        
        // ensure b &gt; a
        if (a &gt; b) {
            long temp = a;
            a = b;
            b = temp;
        }
        // gcd(a, b) = gcd(a, (b-a)/2))
        // the division by 2 happens in the next iteration
        b = b - a;
    }
    // multiply 2^k back into the final gcd value
    return a &lt;&lt; k;
}
</code></pre><p>At each step of the iteration, the algorithm converges by dividing b by 2, so the worst case time complexity is <code>O(log_2(max(a, b))</code>, which is the similar as the modulo-based algorithm. But this algorithm has the advantage that it uses efficient hardware instructions. This means that the algorithm executes fewer instructions while maintaining a higher IPC, which is a win over the modulo algorithm.</p><p>With the theory and hardware considerations in place, let&#8217;s now compare all three algorithms side by side.</p><h2>Benchmark #3: Comparing All Three Algorithms</h2><p>First, let&#8217;s revisit the large-input benchmark: <code>a=1,000,000,000</code> and <code>b=9,223,372,036,854,775,503</code>. The following table summarizes the key findings. </p><p>We see that the binary algorithm performs just as well as the modulo-based algorithm even though it takes a few extra steps to converge.</p><p>We can also analyse this using the Iron Law.</p><ul><li><p>The subtraction algorithm executes 92 billion instructions, which is vastly higher than the ~1 million executed by the other two.</p></li><li><p>The IPC of the subtraction algorithm is very high but the high instruction count dominates because of which it takes significant amount of time to finish.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7w09!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7w09!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png 424w, https://substackcdn.com/image/fetch/$s_!7w09!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png 848w, https://substackcdn.com/image/fetch/$s_!7w09!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png 1272w, https://substackcdn.com/image/fetch/$s_!7w09!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7w09!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png" width="1200" height="252.01938610662359" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:260,&quot;width&quot;:1238,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:42348,&quot;alt&quot;:&quot;A side-by-side comparison of the performance of the three GCD algorithms for the input: a=1,000,000,000 and b=9,223,372,036,854,775,503&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="A side-by-side comparison of the performance of the three GCD algorithms for the input: a=1,000,000,000 and b=9,223,372,036,854,775,503" title="A side-by-side comparison of the performance of the three GCD algorithms for the input: a=1,000,000,000 and b=9,223,372,036,854,775,503" srcset="https://substackcdn.com/image/fetch/$s_!7w09!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png 424w, https://substackcdn.com/image/fetch/$s_!7w09!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png 848w, https://substackcdn.com/image/fetch/$s_!7w09!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png 1272w, https://substackcdn.com/image/fetch/$s_!7w09!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb41dee7-6bbd-4143-90cd-fbf906dbccf9_1238x260.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">A side-by-side comparison of the performance of the three GCD algorithms for the input: <code>a=1,000,000,000</code> and <code>b=9,223,372,036,854,775,503</code></figcaption></figure></div><p>Next, let&#8217;s compare the performance for the 2nd input where <code>a=130000</code>, and <code>b=13</code>. The following table shows the numbers. In this case, we see  that  the binary algorithm matches the performance of the subtraction-based algorithm. This happens because not only it executes fewer instructions but uses efficient instructions, leading to a higher IPC. It strikes the ideal balance from the Iron Law&#8217;s perspective. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qvd4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qvd4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png 424w, https://substackcdn.com/image/fetch/$s_!Qvd4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png 848w, https://substackcdn.com/image/fetch/$s_!Qvd4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png 1272w, https://substackcdn.com/image/fetch/$s_!Qvd4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qvd4!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png" width="1200" height="281.33453561767357" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:260,&quot;width&quot;:1109,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:36680,&quot;alt&quot;:&quot;A side-by-side comparison of the performance of the three GCD algorithms for the input: a=130000 and b=13&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="A side-by-side comparison of the performance of the three GCD algorithms for the input: a=130000 and b=13" title="A side-by-side comparison of the performance of the three GCD algorithms for the input: a=130000 and b=13" srcset="https://substackcdn.com/image/fetch/$s_!Qvd4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png 424w, https://substackcdn.com/image/fetch/$s_!Qvd4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png 848w, https://substackcdn.com/image/fetch/$s_!Qvd4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png 1272w, https://substackcdn.com/image/fetch/$s_!Qvd4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad6b633-00ef-473a-b64a-6622c8346475_1109x260.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">A side-by-side comparison of the performance of the three GCD algorithms for the input: a=130000 and b=13</figcaption></figure></div><p>But comparing the performance on just two inputs is not enough. The following table shows the performance of these algorithms on a benchmark where they compute the GCD for all unique combination of values in the range [1, 100000).</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Osol!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Osol!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png 424w, https://substackcdn.com/image/fetch/$s_!Osol!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png 848w, https://substackcdn.com/image/fetch/$s_!Osol!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png 1272w, https://substackcdn.com/image/fetch/$s_!Osol!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Osol!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png" width="1200" height="247.13560551124002" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:284,&quot;width&quot;:1379,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:65059,&quot;alt&quot;:&quot;Results of a larger benchmark that ran the three algorithms to compute the GCD of all combination of integers in the range [1, 100000) at increments of 5.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/169307156?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="Results of a larger benchmark that ran the three algorithms to compute the GCD of all combination of integers in the range [1, 100000) at increments of 5." title="Results of a larger benchmark that ran the three algorithms to compute the GCD of all combination of integers in the range [1, 100000) at increments of 5." srcset="https://substackcdn.com/image/fetch/$s_!Osol!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png 424w, https://substackcdn.com/image/fetch/$s_!Osol!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png 848w, https://substackcdn.com/image/fetch/$s_!Osol!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png 1272w, https://substackcdn.com/image/fetch/$s_!Osol!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc8b34c-74ea-419e-a2aa-5ef042e3ca32_1379x284.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Results of a larger benchmark that ran the three algorithms to compute the GCD of all combination of integers in the range [1, 100000) at increments of 5.</figcaption></figure></div><p>Let&#8217;s analyse this for a moment:</p><ul><li><p>The binary algorithm is the fastest, while the modulo-based algorithm is the slowest despite having the same time complexity as the binary algorithm. This highlights that a superior complexity is not the only thing that gets you performance, the hardware specific implementation is also crucial.</p></li><li><p>The modulo-based algorithm executed <strong>13 billion instructions</strong> (the lowest) while the subtraction-based algorithm executed <strong>97 billion instructions</strong> (the highest). Yet, the subtraction-based algorithm finished 1.5 seconds earlier. It highlights how efficient integer add operation is in processors as compared to division.</p></li><li><p>The subtraction-based algorithm had an <strong>IPC of 1.16</strong> (the highest), while the modulo-based algorithm had the lowest <strong>IPC of 0.15</strong>.</p></li><li><p>The binary algorithm doesn&#8217;t win in instruction count or IPC. But it outperforms in the overall execution time because from the Iron Law&#8217;s point of view, it strikes the right balance between the two factors. In fact, if you benchmark these algorithms on a wide range of values, the binary algorithm always gives a consistent performance, while the performance of the other two algorithms can vary  depending on the inputs.</p></li></ul><div><hr></div><h2>Conclusion</h2><p>Algorithmic time complexity is important, but real-world performance also depends on how well an algorithm maps to the underlying hardware. Doing less work doesn&#8217;t help if that work is inefficient or poorly suited to the CPU. The fastest implementations are those that align with the strengths of the hardware, such as low-latency instructions, high IPC, and predictable execution.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Did you find this article insightful?<strong> </strong>For more like this, consider upgrading to a paid subscription. You'll get early access to upcoming articles, exclusive content, and discounted access to some of my books and courses.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/big-o-vs-hardware?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/big-o-vs-hardware?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Code</h2><p>All the code used in the analysis behind this article can be found in the GitHub repo linked below.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://github.com/abhinav-upadhyay/gcd_perf&quot;,&quot;text&quot;:&quot;Check out the code&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://github.com/abhinav-upadhyay/gcd_perf"><span>Check out the code</span></a></p><div><hr></div>]]></content:encoded></item><item><title><![CDATA[x86 Assembly Exercise #1: Toy kill Program (Solution)]]></title><description><![CDATA[A step-by-step walkthrough of the toy kill program using raw Linux syscalls.]]></description><link>https://blog.codingconfessions.com/p/x86-assembly-exercise-1</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/x86-assembly-exercise-1</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Sat, 19 Jul 2025 18:54:05 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/168730047/c2c04db236265eeab258cebd5c439a25.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>This is a short video as part of our series on x86-64 assembly. If you have not been following the series, you can start with the <a href="https://blog.codingconfessions.com/p/a-programmers-guide-to-x86-64-assembly">series overview</a>.</p><p>In this video post we will be discussing the solution to the homework exercise I gave at the end of the post on <a href="https://blog.codingconfessions.com/p/making-system-calls-in-x86-64-assembly">system calls</a>. The objective of the exercise was as follows:</p><ul><li><p>Write a toy implementation of the kill command with a few simplifications</p><ul><li><p>Hard code the process id to the pid of any running process on your system</p></li><li><p>Hard code the signal number to 9 (for SIGKILL)</p></li><li><p>Exit the program with the return value of the kill system call</p></li></ul></li></ul><p>If you are yet to try out this exercise for yourself, then I highly recommend that you do it on your own and then come back to this video to verify your solution. </p><p>My aim with such exercises is to give you a taste of systems programming along with teaching assembly. With this combined knowledge of assembly and how things work under the hood, you will be well placed to tackle serious projects in higher-level languages such as C, Rust, Go, Java, etc. </p><p>As always, feel free to comment here or reach out to me on email if you have any questions.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/x86-assembly-exercise-1?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/x86-assembly-exercise-1?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p>
      <p>
          <a href="https://blog.codingconfessions.com/p/x86-assembly-exercise-1">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Understanding Registers and Data Movement in x86-64 Assembly]]></title><description><![CDATA[A hands-on guide to general-purpose registers and data movement in x86-64]]></description><link>https://blog.codingconfessions.com/p/x86-registers</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/x86-registers</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Wed, 16 Jul 2025 12:19:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1LJk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p><strong>&#8220;In the beginning, there was a word. Then came the doubleword, and finally the quadword.&#8221;</strong></p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1LJk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1LJk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1LJk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1LJk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1LJk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1LJk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:309300,&quot;alt&quot;:&quot;Registers in x86-64&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161886060?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Registers in x86-64" title="Registers in x86-64" srcset="https://substackcdn.com/image/fetch/$s_!1LJk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1LJk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1LJk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1LJk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe37357ac-3781-419c-b151-023edbe3ad1f_1536x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Registers in x86-64</figcaption></figure></div><p><em>This article is part of our series on x86-64 assembly. So far we have learned to write simple programs that can move some data around and invoke system calls. For the complete list of articles published so far in this series, check out the <a href="https://blog.codingconfessions.com/p/a-programmers-guide-to-x86-64-assembly">series overview</a>.</em></p><ol><li><p><strong><a href="https://blog.codingconfessions.com/p/seeing-the-matrix">Understanding Computer Organization from First Principles</a></strong><br><em>Bits, memory, and the logic behind modern computing. A gentle dive into the foundations.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/binary-arithmetic-and-bitwise-operations">Binary Arithmetic and Bitwise Operations for Systems Programming</a></strong><br><em>Signed numbers, two's complement, masking tricks, and bit-level manipulations that matter.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/the-system-level-foundation-of-assembly">The System-Level Foundation of Assembly</a></strong><br><em>How your code goes from </em><code>main()</code><em> to a running process, and where assembly fits in.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/building-and-breaking-your-first">Building (and Breaking) Your First X86 Assembly Program</a></strong><br><em>A minimal working program from scratch, with no runtime or C library. Learn by breaking it apart.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/debugging-x86-64-assembly-with-gdb">Debugging X86-64 Assembly with GDB</a></strong><br><em>Hands-on debugging walkthroughs to inspecting registers, memory, and control flow.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/making-system-calls-in-x86-64-assembly">Making System Calls in x86-64 Assembly</a></strong><br><em>How to interact with the operating system directly using syscalls without a C runtime.</em></p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This complete series is exclusive for the paid subscribers. You can upgrade today to unlock it.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I&#8217;m also publishing this in the form an ebook (PDF). If you don&#8217;t wish to upgrade to a subscription, you can purchase the PDF using the following link. If you are a paid subscriber you can get it at a discount (monthly subs: 20% and annual subs: 50%), please email me for the discounted link.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk&quot;,&quot;text&quot;:&quot;Purchase Ebook&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://codingconfessions.gumroad.com/l/ychdk"><span>Purchase Ebook</span></a></p><div><hr></div><h2>Introduction</h2><p>Now that we've written and debugged a few x86-64 assembly programs, it's time to take a closer look at one of the most fundamental pieces of the architecture: the general-purpose registers.</p><p>Rather than throwing a table of names and sizes at you, we'll build up a mental model of how these registers evolved, starting from the 8086 and leading up to modern 64-bit hardware. That historical context makes it much easier to understand the naming conventions and relationships, so you're not constantly wondering where things like <code>sil</code> or <code>r8d</code> came from.</p><p>The article also includes hands-on exercises to help you understand how values move between registers of different sizes, and to develop an intuition for how partial registers behave. Along the way, we&#8217;ll also cover some of the edge cases and architectural quirks. These often overwhelm beginners, but I&#8217;ve tried to present them in the right context, so they&#8217;re easier to understand and less likely to trip you up.</p><div><hr></div><h2>Registers in the 16-bit Era</h2><p>The x86 architecture formally began life with the 8086 processor, which was a 16-bit machine. This meant that it had 16-bit wide registers, and its instructions could operate on values up to 16 bits in size.</p><p>The general-purpose registers were named after the first four letters of the alphabet: <code>ax</code>, <code>bx</code>, <code>cx</code>, and <code>dx</code>.</p><h3>8-bit Register Halves</h3><p>While these registers could work with 16-bit values, there was also a need to handle 8-bit data. Using bitwise masks to access just the higher or lower 8 bits would have been cumbersome and inefficient, requiring extra instructions. To solve this, the 8086 architecture introduced alternate names to refer directly to the upper and lower 8-bit halves of the 16-bit registers. </p><p>The naming was logical: replace the "<code>x</code>" in the 16-bit register name with "<code>h</code>" for the high byte or "<code>l</code>" for the low byte. For example, <code>ah</code> refers to the high 8 bits of <code>ax</code>, and <code>al</code> refers to the low 8 bits.</p><p>The following diagram shows the full set of general-purpose registers in the 8086, including how the 8-bit halves map onto the 16-bit registers:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DN7L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DN7L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png 424w, https://substackcdn.com/image/fetch/$s_!DN7L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png 848w, https://substackcdn.com/image/fetch/$s_!DN7L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png 1272w, https://substackcdn.com/image/fetch/$s_!DN7L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DN7L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png" width="430" height="571" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:571,&quot;width&quot;:430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18788,&quot;alt&quot;:&quot;The breakdown of 16-bit registers and their 8-bit halves in the 8086 processor&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161886060?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The breakdown of 16-bit registers and their 8-bit halves in the 8086 processor" title="The breakdown of 16-bit registers and their 8-bit halves in the 8086 processor" srcset="https://substackcdn.com/image/fetch/$s_!DN7L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png 424w, https://substackcdn.com/image/fetch/$s_!DN7L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png 848w, https://substackcdn.com/image/fetch/$s_!DN7L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png 1272w, https://substackcdn.com/image/fetch/$s_!DN7L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e058f0-518e-4f71-9db0-5c5d13a6d935_430x571.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The breakdown of 16-bit registers and their 8-bit halves in the 8086 processor</figcaption></figure></div><h3>Word Size and Instruction Suffixes</h3><p>If you remember, when we wrote our first x86-64 assembly program, we wrote the following instruction:</p><pre><code><code>movq $32, %rdi</code></code></pre><p>Here, <code>mov</code> is the instruction, and the <code>q</code> suffix stands for "<em>quadword</em>", which in x86-64 means 64 bits.</p><p>x86 uses suffixes to indicate operand sizes: 8-bit, 16-bit, 32-bit, and 64-bit. These suffixes evolved along with the architecture, and we'll explore them as we move from 16-bit to 64-bit.</p><p>You're right to think that if a quadword is 64 bits, then a word must be 16 bits. The 8086 was a 16-bit processor, and as a result its word size was also 16 bits. In computer architecture, the word size is the number of bits of data that the processor can handle in a single operation. So, the assembly instructions for 8086 used the suffix &#8220;<code>w</code>" for 16-bit values.</p><h3>Hands-on Exercise: Working with 16-bit Registers</h3><p>Here&#8217;s an example that writes two 16-bit values into <code>ax</code> and <code>bx</code>, computes their difference, and exits.</p><pre><code><code>.text

.globl _start
_start:
    # write two 16-bit values into ax and bx
    movw $100, %ax
    movw $58, %bx

    # compute the difference: ax = ax - bx
    subw %bx, %ax

    # exit with status code: 0    
    movq $60, %rax
    # xoring rdi with itself zeroes it
    xorq %rdi, %rdi
    syscall</code></code></pre><p>Try running this inside <code>gdb</code>, and observe the values of the registers <code>ax</code> and <code>bx</code> after each instruction. You can use the following commands to do this:</p><pre><code>p (short) $ax 
p (short) $bx</code></pre><blockquote><p><strong>Note About the </strong><code>xor</code><strong> Instruction</strong>: In the above program, <code>xorq %rdi, %rdi</code> zeroes out the <code>rdi</code> register. This is a common and efficient trick: XOR-ing a register with itself always results in zero.</p></blockquote><h3>Hands-on Exercise: Working with 8-bit Registers</h3><p>Let&#8217;s run a small program that helps you visualize how the <code>ah</code> and <code>al</code> 8-bit halves relate to the full 16-bit <code>ax</code> register.</p><pre><code><code>.text
.globl _start

_start:
    # write a 16-bit value 0x1234 into ax
    movw $0x1234, %ax

    # copy the high 8 bits of ax into bl
    movb %ah, %bl

    # copy the low 8 bits of ax into ch
    movb %al, %ch

    # exit
    movq $60, %rax
    xorq %rdi, %rdi
    syscall
</code></code></pre><p>Try this in GDB, and inspect the values of %ax, %bl, and %ch after each instruction. You should see:</p><ul><li><p><code>%ax</code> contains <code>0x1234</code></p></li><li><p><code>%ah</code> (upper byte of <code>ax</code>) is <code>0x12</code> &#8594; copied to <code>%bl</code></p></li><li><p><code>%al</code> (lower byte of <code>ax</code>) is <code>0x34</code> &#8594; copied to <code>%ch</code></p></li></ul><p>You can use the following commands to inspect the values of these registers:</p><pre><code>p (short) $ax
p (char) $bl
p (char) $ch</code></pre><div><hr></div><h2>Evolution to x86-32 Architecture</h2>
      <p>
          <a href="https://blog.codingconfessions.com/p/x86-registers">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[A Programmer’s Guide to x86-64 Assembly (Series Overview)]]></title><description><![CDATA[Welcome to my ongoing series on x86-64 assembly programming, designed for programmers who want to peel back the abstraction and understand how code really runs at the machine level.]]></description><link>https://blog.codingconfessions.com/p/a-programmers-guide-to-x86-64-assembly</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/a-programmers-guide-to-x86-64-assembly</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Wed, 16 Jul 2025 05:14:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pFGm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pFGm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pFGm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pFGm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pFGm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pFGm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pFGm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:347141,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/168445561?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pFGm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pFGm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pFGm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pFGm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a94b5af-aec2-4b17-b011-5c128c67be8d_1536x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Welcome to my ongoing series on <strong>x86-64 assembly programming</strong>, designed for programmers who want to peel back the abstraction and understand how code really runs at the machine level.</p><p>Why should a software engineer care about assembly? Because understanding what's happening at the lowest level helps you write better code at every level. It sharpens your intuition about performance bottlenecks, compiler behavior, memory usage, and even security. Whether you're debugging a weird bug, chasing a perf regression, or just curious how high-level constructs boil down to machine instructions, assembly is the Rosetta Stone.</p><p>We start from first principles, covering bits, memory, and CPU instructions, and gradually build up the skills to read and write real-world assembly programs. Whether you're interested in systems programming, performance tuning, or just curious about what your compiler is really doing under the hood, this series is for you.</p><div><hr></div><h2>Published Posts</h2><ol><li><p><strong><a href="https://blog.codingconfessions.com/p/seeing-the-matrix">Understanding Computer Organization from First Principles</a></strong><br><em>Bits, memory, and the logic behind modern computing. A gentle dive into the foundations.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/binary-arithmetic-and-bitwise-operations">Binary Arithmetic and Bitwise Operations for Systems Programming</a></strong><br><em>Signed numbers, two's complement, masking tricks, and bit-level manipulations that matter.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/the-system-level-foundation-of-assembly">The System-Level Foundation of Assembly</a></strong><br><em>How your code goes from </em><code>main()</code><em> to a running process, and where assembly fits in.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/building-and-breaking-your-first">Building (and Breaking) Your First X86 Assembly Program</a></strong><br><em>A minimal working program from scratch, with no runtime or C library. Learn by breaking it apart.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/debugging-x86-64-assembly-with-gdb">Debugging X86-64 Assembly with GDB</a></strong><br><em>Hands-on debugging walkthroughs to inspecting registers, memory, and control flow.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/making-system-calls-in-x86-64-assembly">Making System Calls in x86-64 Assembly</a></strong><br><em>How to interact with the operating system directly using syscalls without a C runtime.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/x86-registers">Understanding Registers and Data Movement in x86-64 Assembly</a></strong></p><p><em>Systematic coverage of the general-purpose registers in x86-64 architecture and how to move data between them.</em></p></li><li><p><strong><a href="https://blog.codingconfessions.com/p/x86-addressing-modes-part-1-immediate">x86 Addressing Modes, Part 1 &#8212; Immediate and Direct Access</a></strong></p><p><em>Learn about static data allocation, and accessing memory using immediate and direct access modes. Setting up the foundation for the more advanced addressing modes in the upcoming articles. You will master these two addressing modes by implementing interesting exercises. For immediate addressing mode, you write your own implementation of the cat utility in x86 assembly and for direct memory addressing, you write a benchmarking program.</em></p></li></ol><p></p><div><hr></div><h2>Upcoming Topics</h2><p>Here&#8217;s a peek at what&#8217;s planned for future posts (subject to change based on feedback and curiosity):</p><ul><li><p>Registers, stack, and calling conventions</p></li><li><p>Memory addressing and pointer arithmetic</p></li><li><p>Writing loops and conditionals in pure assembly</p></li><li><p>Implementing functions and recursion</p></li><li><p>A deeper dive into Linux syscalls (file I/O, process management, etc.)</p></li><li><p>Mini-project: writing a simple command-line utility</p></li><li><p>Capstone: building a minimal web server in assembly</p></li></ul><div><hr></div><p>You can subscribe to get new posts as they drop. I&#8217;m writing this series with care, making sure each part builds up your intuition as well as your skillset. Feel free to share, comment, or ask questions.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/subscribe?"><span>Subscribe now</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/a-programmers-guide-to-x86-64-assembly?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/a-programmers-guide-to-x86-64-assembly?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p>]]></content:encoded></item><item><title><![CDATA[Why This Python Performance Trick Doesn’t Matter Anymore]]></title><description><![CDATA[A deep dive into Python&#8217;s name resolution, bytecode, and how CPython 3.11 quietly made a popular optimization irrelevant.]]></description><link>https://blog.codingconfessions.com/p/old-python-performance-trick</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/old-python-performance-trick</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Sat, 28 Jun 2025 11:35:12 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5f4257f6-cdc9-41f7-a7ba-f5af35428aef_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The trick to performance optimization is mechanical sympathy: writing code that makes it easier for the hardware to execute it efficiently. In the past, CPU microarchitectures evolved so quickly that an optimization might become obsolete in just a few years because the hardware had simply become better at running the same code.</p><p>The same idea applies when writing code in interpreted languages like Python. Sometimes you need to use tricks that help the language&#8217;s virtual machine (VM) run your code faster. But just like hardware improves, the Python VM and compiler also keep evolving. As a result, optimizations that once made a difference may no longer matter.</p><p>One such optimization trick in Python is to create a local alias for a function you&#8217;re calling repeatedly inside a hot loop. Here&#8217;s what that looks like:</p><pre><code><code># Benchmark 1: Calling built-in len directly
def test_builtin_global(lst: list):
    for _ in range(1_000_000):
        len(lst)

# Benchmark 2: Aliasing built-in len to a local variable
def test_builtin_local(lst: list):
    l = len
    for _ in range(1_000_000):
        l(lst)
</code></code></pre><p>This trick works because of how Python resolves variable names. Creating a local alias replaces a global lookup with a local one, which is much faster in CPython. But is it still worth doing?</p><p>I benchmarked this code across recent Python releases, and the results suggest that the answer is: not really. So what changed?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yze-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yze-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png 424w, https://substackcdn.com/image/fetch/$s_!yze-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png 848w, https://substackcdn.com/image/fetch/$s_!yze-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png 1272w, https://substackcdn.com/image/fetch/$s_!yze-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yze-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png" width="1135" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:1135,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52258,&quot;alt&quot;:&quot;Performance of global vs local object access across recent CPython releases&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/166575181?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Performance of global vs local object access across recent CPython releases" title="Performance of global vs local object access across recent CPython releases" srcset="https://substackcdn.com/image/fetch/$s_!yze-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png 424w, https://substackcdn.com/image/fetch/$s_!yze-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png 848w, https://substackcdn.com/image/fetch/$s_!yze-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png 1272w, https://substackcdn.com/image/fetch/$s_!yze-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49b012e7-97b3-4ec2-83ef-5d0f036d41d4_1135x322.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Performance of global vs local object access across recent CPython releases</figcaption></figure></div><p>To answer that, we&#8217;ll need to dig into how Python resolves names during execution, and how that behavior has evolved in recent versions. In particular, we&#8217;ll explore:</p><ul><li><p>Why this trick worked in earlier versions of Python</p></li><li><p>What changed in recent CPython releases to make it mostly obsolete</p></li><li><p>Whether there are still edge cases where it helps</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Writing this article took me several days and nights. You can support my work by becoming a paid subscriber. As a paid subscriber you get early access to all articles, exclusive articles and discounted access to courses/books.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div></li></ul><div><hr></div><h2><a href="https://coderabbit.link/abhinav">Cut Code Review Time &amp; Bugs in Half (Sponsored)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://coderabbit.link/abhinav" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DpxV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 424w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 848w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 1272w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DpxV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://coderabbit.link/abhinav&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DpxV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 424w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 848w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 1272w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Get Started with Code Rabbit Today to Simplify your Code Reviews</figcaption></figure></div><p>Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.</p><p>Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.</p><p>CodeRabbit has so far reviewed more than 10 million PRs, installed on 1 million repositories, and used by 70 thousand Open-source projects. CodeRabbit is free for all open-source repos.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://coderabbit.link/abhinav&quot;,&quot;text&quot;:&quot;Get Started Today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://coderabbit.link/abhinav"><span>Get Started Today</span></a></p><div><hr></div><h2>How Python Resolves Local and Global Names</h2><p>To understand why this trick made a difference in performance, we need to look at how the Python interpreter resolves variable names, specifically, how it loads locally vs globally scoped objects.</p><p>Python uses a stack-based virtual machine. This means it evaluates expressions by pushing operands onto a stack and performing operations by popping those operands off. For example, to evaluate <code>a + b</code>, the interpreter pushes <code>a</code> and <code>b</code> onto the stack, pops them off, performs the addition, and then pushes the result back on.</p><p>Function calls work the same way. For a call like <code>len(lst)</code>, the interpreter pushes both the function object <code>len</code> and its argument <code>lst</code> onto the stack, then pops and uses them to execute the function.</p><p>But from where does the interpreter find and load objects like <code>len</code> or <code>lst</code>?</p><p>The interpreter checks three different places when resolving names:</p><ul><li><p><strong>Locals</strong>: A table of locally scoped variables, including function arguments. In CPython, this is implemented as an array (shared with the VM stack). The compiler emits the <code>LOAD_FAST</code> instruction with a precomputed index to retrieve values from this table, which makes local lookups very fast.</p></li><li><p><strong>Globals</strong>: A dictionary of global variables, including imported modules and functions. Accessing this requires a hash lookup using the variable&#8217;s name, which is slower than a local array access.</p></li><li><p><strong>Builtins</strong>: Functions like <code>len</code>, <code>min</code>, and <code>max</code>. These live in a separate dictionary and are checked last if the name isn&#8217;t found in globals.</p></li></ul><p>With that understanding of how name resolution works in CPython, let&#8217;s now compare the disassembly of the two versions of our benchmark function.</p><p><em>For a more comprehensive coverage of the CPython virtual machine, check out my article on its internals:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;bddbac26-ccae-4a4a-b672-e0784602edc0&quot;,&quot;caption&quot;:&quot;For every bytecode compiled language, the most interesting part of its implementation is its virtual machine (also referred to as the bytecode interpreter) where the bytecode execution takes place. Because this is such a crucial part of the language machinery, its implementation has to be highly performant. Even if you are not a compiler engineer, learning about such internal implementation can give you new performance tricks and insights that you may be able to use in other places of your job. And, if you are a compiler engineer then you should always look around how other languages are implemented to pickup implementation details that you may not be aware of.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Design &amp; Implementation of the CPython Virtual Machine&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-08-31T14:35:14.115Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1504639725590-34d0984388bd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxOXx8dmlydHVhbCUyMG1hY2hpbmV8ZW58MHx8fHwxNzI1MDI0MzE1fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/cpython-vm-internals&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143567425,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:45,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!lstI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h2>Dissecting Unoptimized Python Bytecode</h2><p>Let&#8217;s take a look at what&#8217;s actually happening under the hood. We can use Python&#8217;s built-in <code>dis</code> module to view the bytecode generated by our functions. Below is the disassembly of the slower version, the one that calls <code>len</code> directly:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9aqh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9aqh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png 424w, https://substackcdn.com/image/fetch/$s_!9aqh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png 848w, https://substackcdn.com/image/fetch/$s_!9aqh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png 1272w, https://substackcdn.com/image/fetch/$s_!9aqh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9aqh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png" width="1234" height="470" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:470,&quot;width&quot;:1234,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The bytecode disassembly for the slow (unoptimized) version&quot;,&quot;title&quot;:&quot;The bytecode disassembly for the slow (unoptimized) version&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The bytecode disassembly for the slow (unoptimized) version" title="The bytecode disassembly for the slow (unoptimized) version" srcset="https://substackcdn.com/image/fetch/$s_!9aqh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png 424w, https://substackcdn.com/image/fetch/$s_!9aqh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png 848w, https://substackcdn.com/image/fetch/$s_!9aqh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png 1272w, https://substackcdn.com/image/fetch/$s_!9aqh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e7f42b7-8f5e-4762-8681-b7f6045483ac_1234x470.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The bytecode disassembly for the slow (unoptimized) version</figcaption></figure></div><p>Let&#8217;s break down what&#8217;s happening in those highlighted instructions:</p><ul><li><p><strong>LOAD_GLOBAL</strong>: This instruction loads the name <code>len</code> from the global scope on the stack. In the disassembly, you&#8217;ll see something like <code>LOAD_GLOBAL 3 (NULL + len)</code>. That <code>3</code> is the argument passed to the instruction. It&#8217;s an index into the <code>co_names</code> array, which is a tuple of all names used in the function for global or builtin lookups. So, <code>co_names[3]</code> gives <code>'len'</code>. The interpreter retrieves the string <code>'len'</code>, hashes it, and performs a dictionary lookup in <code>globals()</code>, falling back to <code>builtins</code> if needed. This multi-step lookup makes <code>LOAD_GLOBAL</code> more expensive than other name resolution instructions. (We will look at how <code>LOAD_GLOBAL</code> is implemented in CPython right after this)</p></li><li><p><strong>LOAD_FAST</strong>: After loading the function that is to be called, the next thing the interpreter needs to do is to push all the arguments. In this case, len takes only one argument which is the list object. This is done using the <code>LOAD_FAST</code> instruction. It loads the <code>lst</code> object from the local variables using a direct index into an array of local variables, so there&#8217;s no hashing or dictionary lookup involved. It&#8217;s just a simple array access, which makes it very fast.</p></li><li><p><strong>CALL</strong>: Next, the interpreter needs to perform the function call. This is done using the <code>CALL</code> instruction. The number after <code>CALL</code> tells the interpreter how many arguments are being passed. So, <code>CALL 1</code> means one argument is being supplied. To execute the call, the interpreter pops that many arguments from the stack, followed by the function object itself. It then calls the function with those arguments and pushes the return value back onto the stack.</p></li></ul><p>One of the costlier steps here is <code>LOAD_GLOBAL</code>, both in terms of what it does and how it&#8217;s implemented. We&#8217;ve already seen that it involves looking up a name from the <code>co_names</code> array, hashing it, and checking two dictionaries, <code>globals()</code> and <code>builtins()</code>, before it can push the result onto the stack. All of that makes it noticeably slower than a simple local access.</p><p>To understand just how much work it does behind the scenes, let&#8217;s now take a look at its actual implementation in CPython.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!asx2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!asx2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png 424w, https://substackcdn.com/image/fetch/$s_!asx2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png 848w, https://substackcdn.com/image/fetch/$s_!asx2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!asx2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!asx2!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png" width="1200" height="1087.4403815580285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1140,&quot;width&quot;:1258,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The implementation of the LOAD_GLOBAL instruction in CPython from the file generated_cases.c.h&quot;,&quot;title&quot;:&quot;The implementation of the LOAD_GLOBAL instruction in CPython from the file generated_cases.c.h&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="The implementation of the LOAD_GLOBAL instruction in CPython from the file generated_cases.c.h" title="The implementation of the LOAD_GLOBAL instruction in CPython from the file generated_cases.c.h" srcset="https://substackcdn.com/image/fetch/$s_!asx2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png 424w, https://substackcdn.com/image/fetch/$s_!asx2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png 848w, https://substackcdn.com/image/fetch/$s_!asx2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!asx2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e0d804-26f1-4fc7-aa97-006a64abfe83_1258x1140.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The implementation of the LOAD_GLOBAL instruction in CPython from the file generated_cases.c.h</figcaption></figure></div><p>The code is taken from the file <a href="https://github.com/python/cpython/blob/main/Python/generated_cases.c.h">generated_cases.c.h</a> which contains all the opcode implementations. Let&#8217;s focus on the highlighted parts that I have numbered.</p><ol><li><p>The first highlighted block deals with instruction specialization. As we will see later, the default way of looking up globals is slow because it does not know which global symbol we are trying to load and from where. This information is only available to the interpreter at runtime. Instruction specialization caches this dynamic information and creates a specialized instruction, making future executions of the same code faster. We will circle back to this in a later section. Note that, this optimization was not present before CPython 3.11.</p></li><li><p>The second highlighted block is where the actual global lookup happens. It&#8217;s broken into two parts, which I&#8217;ve marked with arrows labeled 3 and 4.</p></li><li><p>First, the interpreter needs to figure out which name it&#8217;s supposed to look up. The <code>LOAD_GLOBAL</code> instruction receives an argument (<code>oparg</code>), which is an index into the <code>co_names</code> tuple. This is where all global and builtin names used in the function are stored. The interpreter calls the <code>GETITEM</code> macro to fetch the actual name (a string object) using this index.</p></li><li><p>Once the name is retrieved, the interpreter calls <code>_PyEval_LoadGlobalStackRef</code>. This function looks for the name in the <code>globals</code> dictionary first. If it&#8217;s not found there, it falls back to the <code>builtins</code> dictionary.</p></li></ol><p>Let&#8217;s zoom into this part and see the code for doing this globals and builtins lookup. <code>_PyEval_LoadGlobalStackRef</code> simply delegates to a function called <code>_PyDict_LoadGlobalStackRef</code>, defined in <code>dictobject.c</code>, so let&#8217;s directly look at its implementation (shown in the figure below).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u2if!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u2if!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png 424w, https://substackcdn.com/image/fetch/$s_!u2if!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png 848w, https://substackcdn.com/image/fetch/$s_!u2if!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png 1272w, https://substackcdn.com/image/fetch/$s_!u2if!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u2if!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png" width="1452" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0cf2036-5cf5-4b61-ab42-7fcbbea47544_1452x748.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The function where the actual global lookup is performed&quot;,&quot;title&quot;:&quot;The function where the actual global lookup is performed&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The function where the actual global lookup is performed" title="The function where the actual global lookup is performed" srcset="https://substackcdn.com/image/fetch/$s_!u2if!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png 424w, https://substackcdn.com/image/fetch/$s_!u2if!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png 848w, https://substackcdn.com/image/fetch/$s_!u2if!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png 1272w, https://substackcdn.com/image/fetch/$s_!u2if!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79190af0-268b-4461-88cb-9f2202e7a656_1452x748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The function where the actual global lookup is performed</figcaption></figure></div><p>Here&#8217;s what is happening in this code:</p><ol><li><p>First, the function computes the hash of the name which is being looked up. This hash determines the index into the dictionary&#8217;s internal hash table.</p></li><li><p>Next, the function checks the globals dictionary.</p></li><li><p>If the name isn&#8217;t found in <code>globals</code>, the function falls back to checking the <code>builtins</code> dictionary.</p></li></ol><p>From this entire discussion of global lookups in CPython, few things are worth highlighting:</p><ul><li><p>The lookup requires a hash computation. This means that when you are repeatedly calling a function in a loop, the runtime is computing the hash each time. That said, string hashes are cached, so the overhead isn&#8217;t as bad as it might seem.</p></li><li><p>Another thing to note here is that builtins are checked last. So even if you&#8217;re calling a builtin function, the runtime still checks globals first and only then builtins. In a hot loop where performance matters, these things matter. </p></li></ul><p>Next, we&#8217;ll dissect the disassembly of the code with the optimization in place.</p><div><hr></div><h2>Dissecting Optimized Python Bytecode</h2><p>Let&#8217;s see how using a local alias actually changes the bytecode, and why it makes the optimized version faster. The following figure shows the bytecode disassembly for this version:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R0jY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R0jY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png 424w, https://substackcdn.com/image/fetch/$s_!R0jY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png 848w, https://substackcdn.com/image/fetch/$s_!R0jY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png 1272w, https://substackcdn.com/image/fetch/$s_!R0jY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R0jY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png" width="1427" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:1427,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The bytecode disassembly for the optimized Python code&quot;,&quot;title&quot;:&quot;The bytecode disassembly for the optimized Python code&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The bytecode disassembly for the optimized Python code" title="The bytecode disassembly for the optimized Python code" srcset="https://substackcdn.com/image/fetch/$s_!R0jY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png 424w, https://substackcdn.com/image/fetch/$s_!R0jY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png 848w, https://substackcdn.com/image/fetch/$s_!R0jY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png 1272w, https://substackcdn.com/image/fetch/$s_!R0jY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b27ed92-30ec-44a3-9b5b-55c49d178bf0_1427x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The bytecode disassembly for the optimized version where we create a local alias for the len function</figcaption></figure></div><p>Let&#8217;s focus on the highlighted instructions that are responsible for the call to <code>l</code>, which is the alias we created for <code>len</code>. The key difference between the unoptimized and this version is that this one uses the <code>LOAD_FAST</code> instruction instead of <code>LOAD_GLOBAL</code> to load the function object onto the stack. So, let&#8217;s look at how <code>LOAD_FAST</code> is implemented in CPython (shown in the figure below).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Vyq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Vyq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png 424w, https://substackcdn.com/image/fetch/$s_!4Vyq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png 848w, https://substackcdn.com/image/fetch/$s_!4Vyq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png 1272w, https://substackcdn.com/image/fetch/$s_!4Vyq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Vyq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png" width="1330" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7104d2f-cb4c-40b7-b46b-852c05275bde_1330x443.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:1330,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The implementation of the LOAD_FAST instruction in CPython&quot;,&quot;title&quot;:&quot;The implementation of the LOAD_FAST instruction in CPython&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The implementation of the LOAD_FAST instruction in CPython" title="The implementation of the LOAD_FAST instruction in CPython" srcset="https://substackcdn.com/image/fetch/$s_!4Vyq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png 424w, https://substackcdn.com/image/fetch/$s_!4Vyq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png 848w, https://substackcdn.com/image/fetch/$s_!4Vyq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png 1272w, https://substackcdn.com/image/fetch/$s_!4Vyq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7625657c-14f3-4ca5-bc5a-5e10741d1c41_1330x443.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The implementation of the LOAD_FAST instruction in CPython</figcaption></figure></div><p>You can see how short and tight this implementation is. It performs a simple array lookup using an index passed to it as argument. Unlike <code>LOAD_GLOBAL</code>, which involves multiple function calls and dictionary lookups, <code>LOAD_FAST</code> doesn&#8217;t call anything. It&#8217;s just a direct memory access, which makes it extremely fast.</p><p>By now, you should have a clear understanding of why this optimization trick works. By creating a local variable for the <code>len</code> builtin, we turned an expensive global lookup into a fast local lookup, which is what makes the performance difference.</p><p>But as we saw in the benchmark results, starting with CPython 3.11, this optimization no longer makes a meaningful difference in performance. So, what changed? Let&#8217;s see that next.</p><div><hr></div><h2>Inside CPython's Instruction Specialization</h2><p>CPython 3.11 introduced a major optimization called the <a href="https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep659">specializing adaptive interpreter</a>. It addresses one of the core performance challenges in dynamically typed languages. In such languages, bytecode instructions are type-agnostic, meaning they don&#8217;t know what types of objects they will operate on. For example, CPython has a generic instruction called <code>BINARY_OP</code>, which is used for all binary operations like <code>+</code>, <code>-</code>, <code>*</code>, and <code>/</code>. It works with all object types, including ints, strings, lists, and so on. Therefore, the interpreter has to first check object types at runtime and then dispatch to the appropriate function accordingly.</p><p>So how does instruction specialization work? When a bytecode instruction is executed for the first time, the interpreter captures some of the runtime information about it, such as the type of the objects, the specific operation being performed, etc. Using that information, it replaces the slow generic instruction with a faster specialized instruction.</p><p>Thereafter, whenever the same line of Python code executes again, the interpreter executes the specialized instruction. Inside the specialized instructions, the interpreter always checks that the conditions for specialization still hold true. If the conditions have changed, e.g., the types are no longer the same, then the interpreter deoptimizes and falls back to the slower instruction.</p><p>The <code>LOAD_GLOBAL</code> instruction is also a generic instruction. In this case, the interpreter has to do a lot of additional work, such as looking up the name of the symbol, computing the hash, and finally performing lookups in the globals and builtins dictionaries. But once the interpreter sees that you&#8217;re accessing a specific builtin, it specializes <code>LOAD_GLOBAL</code> into <code>LOAD_GLOBAL_BUILTIN</code>.</p><p>The <code>LOAD_GLOBAL_BUILTIN</code> instruction is optimized to check the builtins dictionary directly, i.e., it skips checking the globals dictionary. It also caches the index of the specific builtin we are trying to lookup, which avoids the hash computation. The result is that it behaves almost like a <code>LOAD_FAST</code>, performing a fast array lookup instead of a costly dictionary access. The following figure shows its implementation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7UMH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7UMH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png 424w, https://substackcdn.com/image/fetch/$s_!7UMH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png 848w, https://substackcdn.com/image/fetch/$s_!7UMH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png 1272w, https://substackcdn.com/image/fetch/$s_!7UMH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7UMH!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png" width="1200" height="1089.5604395604396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1322,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The implementation of the LOAD_GLOBAL_BUILTIN instruction in CPython. It is a specialized version of the LOAD_GLOBAL instruction to directly lookup the builtins, skipping the globals check.&quot;,&quot;title&quot;:&quot;The implementation of the LOAD_GLOBAL_BUILTIN instruction in CPython. It is a specialized version of the LOAD_GLOBAL instruction to directly lookup the builtins, skipping the globals check.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="The implementation of the LOAD_GLOBAL_BUILTIN instruction in CPython. It is a specialized version of the LOAD_GLOBAL instruction to directly lookup the builtins, skipping the globals check." title="The implementation of the LOAD_GLOBAL_BUILTIN instruction in CPython. It is a specialized version of the LOAD_GLOBAL instruction to directly lookup the builtins, skipping the globals check." srcset="https://substackcdn.com/image/fetch/$s_!7UMH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png 424w, https://substackcdn.com/image/fetch/$s_!7UMH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png 848w, https://substackcdn.com/image/fetch/$s_!7UMH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png 1272w, https://substackcdn.com/image/fetch/$s_!7UMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17a2d2c7-b946-427c-af11-b2e71f6cc322_1784x1620.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The implementation of the <code>LOAD_GLOBAL_BUILTIN</code> instruction in CPython.</figcaption></figure></div><p>Let&#8217;s break down the highlighted parts:</p><ol><li><p>First, the instruction performs some checks to ensure that the conditions for which it specialized the <code>LOAD_GLOBAL</code> instruction to this specialized version still hold true. If the conditions no longer hold, it falls back to the generic <code>LOAD_GLOBAL</code> implementation.</p></li><li><p>After that, it reads the cached index value. This is based on the hash value it computed the last time while executing <code>LOAD_GLOBAL</code>. It means that this instruction is specialized for looking up only the <code>len</code> function.</p></li><li><p>Next is the lookup in the builtins dictionary. This requires first getting access to the keys within the dictionary.</p></li><li><p>From the keys, it gets the list of entries in the internal hash table and looks it up using the cached index value. If it finds an entry, that is the object we were trying to load.</p></li></ol><p>As you can see, an expensive hash table lookup turned into an array lookup using a known index, which is almost the same amount of work as the <code>LOAD_FAST</code> instruction. This is the reason that in the newer CPython releases, we don&#8217;t explicitly need to do the kinds of optimizations where we create a local variable for a global function or object. It automatically gets optimized.</p><p>But is this optimization of creating a local alias really obsolete? Maybe not. Let me show you another benchmark.</p><div><hr></div><h2>Benchmarking Imported Functions Vs Aliases</h2><p>Let&#8217;s now look at a similar benchmark, this time involving a function from an imported module rather than a builtin. Here&#8217;s what the code looks like:</p><pre><code><code>import timeit
import math

# Benchmark 1: Calling math.sin directly
def benchmark_math_qualified():
    for i in range(1000000):
        math.sin(i)

# Benchmark 2: Aliasing math.sin to a local variable
def benchmark_math_alias():
    mysin = math.sin
    for i in range(1000000):
        mysin(i)



# Benchmark 3: Calling sin imported via `from math import sin`
from math import sin
def benchmark_from_import():
    for i in range(1000000):
        sin(i)</code></code></pre><p>There are three benchmarks:</p><ol><li><p><strong>benchmark_math_qualified</strong>: calls <code>math.sin</code> directly</p></li><li><p><strong>benchmark_math_alias</strong>: creates a local alias <code>mysin</code> for <code>math.sin</code></p></li><li><p><strong>benchmark_from_import</strong>: uses <code>sin</code> imported via <code>from math import sin</code></p></li></ol><p>And the following table shows the results across the recent CPython releases.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-IJ6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-IJ6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png 424w, https://substackcdn.com/image/fetch/$s_!-IJ6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png 848w, https://substackcdn.com/image/fetch/$s_!-IJ6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png 1272w, https://substackcdn.com/image/fetch/$s_!-IJ6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-IJ6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png" width="1456" height="307" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:307,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The benchmark results for accessing a function with fully qualified name, a locally aliased name and directly importing it from the module.&quot;,&quot;title&quot;:&quot;The benchmark results for accessing a function with fully qualified name, a locally aliased name and directly importing it from the module.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The benchmark results for accessing a function with fully qualified name, a locally aliased name and directly importing it from the module." title="The benchmark results for accessing a function with fully qualified name, a locally aliased name and directly importing it from the module." srcset="https://substackcdn.com/image/fetch/$s_!-IJ6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png 424w, https://substackcdn.com/image/fetch/$s_!-IJ6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png 848w, https://substackcdn.com/image/fetch/$s_!-IJ6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png 1272w, https://substackcdn.com/image/fetch/$s_!-IJ6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12dae4-55f7-43c7-aecb-4fe631fcae09_1529x322.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The benchmark results for accessing a function with fully qualified name, a locally aliased name and directly importing it from the module.</figcaption></figure></div><p>In this case, we see that calling <code>math.sin</code> (fully qualified name) is slowest across the releases and creating an alias is fastest. While calling &#8220;<code>math.sin</code>&#8221; directly has gotten faster in recent Python versions, it still lags behind the alternatives in performance.</p><p>The performance gap here comes from how the function object is resolved when using a fully qualified name like <code>math.sin</code>. It turns into a two-level lookup. For example, the following figure shows the disassembly for calling <code>math.sin(10)</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XmWQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XmWQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png 424w, https://substackcdn.com/image/fetch/$s_!XmWQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png 848w, https://substackcdn.com/image/fetch/$s_!XmWQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png 1272w, https://substackcdn.com/image/fetch/$s_!XmWQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XmWQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png" width="1208" height="252" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:252,&quot;width&quot;:1208,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46738,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/166575181?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XmWQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png 424w, https://substackcdn.com/image/fetch/$s_!XmWQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png 848w, https://substackcdn.com/image/fetch/$s_!XmWQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png 1272w, https://substackcdn.com/image/fetch/$s_!XmWQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7dd5bc2-373d-4787-8975-8c0cdd8d9660_1208x252.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The bytecode disassembly for math.sin(10)</figcaption></figure></div><p>Notice that now the interpreter has to execute two instructions to load the function object on the stack: <code>LOAD_GLOBAL</code> followed by <code>LOAD_ATTR</code>. <code>LOAD_GLOBAL</code> loads the <code>math</code> module object on the stack from the global scope. Then, <code>LOAD_ATTR</code> performs a lookup for the <code>sin</code> function in the <code>math</code> module and pushes the function object on the stack.</p><p>So, naturally this requires much more work. And the work increases as the number of levels of lookups increase. For example, <code>foo.bar.baz()</code> requires three levels of lookups.</p><p>With the recent Python releases, the performance of fully qualified invocation has also improved due to instruction specialization. However, you still have multiple instructions to execute. Whereas in the case of a local alias, the interpreter has to execute a single <code>LOAD_FAST</code> instruction.</p><p>Whether it&#8217;s worth trading the readability of a fully qualified name, such as <code>math.sin</code> for a small speedup by aliasing it to <code>mysin</code>, depends on your goals. If that part of the code is performance-sensitive, and your profiling shows this line is a bottleneck, then it&#8217;s worth considering. Otherwise, readability might matter more.</p><div><hr></div><h2>Wrapping Up</h2><p>Aliasing global functions to local variables used to be a meaningful optimization. In earlier versions of Python, global lookups involved more overhead, and avoiding them made a measurable difference. With recent improvements in CPython, especially instruction specialization, that gap has narrowed for many cases.</p><p>Even so, not all lookups are equal. Accessing functions through a module or a deep attribute chain can still carry overhead. Creating a local alias or using <code>from module import name</code> continues to be effective in those situations.</p><p>The larger point is that optimizations don&#8217;t last forever. They depend on the details of the language runtime, which keeps evolving. What worked in the past might no longer matter today. If you want performance, it helps to understand how things actually work. That context makes it easier to know which tricks are worth keeping, and which ones you can leave behind in favor of cleaner, simpler code.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/old-python-performance-trick?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/old-python-performance-trick?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Found this useful? A paid subscription ($6.50/month or $58/year) gives you early access to future posts, exclusive deep dives, live sessions to solidify your understanding, and discounts on courses and books to level up faster.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Making System Calls in x86-64 Assembly]]></title><description><![CDATA[Watch now | Privilege levels, syscall conventions, and how assembly code talks to the Linux kernel]]></description><link>https://blog.codingconfessions.com/p/making-system-calls-in-x86-64-assembly</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/making-system-calls-in-x86-64-assembly</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Mon, 16 Jun 2025 17:44:06 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/166083966/dbfe5ec31f037b047510fa4df0a90f14.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<h2><strong>Introduction</strong></h2><p>In the previous article, we learned to<a href="https://blog.codingconfessions.com/p/debugging-x86-64-assembly-with-gdb"> use gdb</a> and used it to debug our crashing program. Eventually, we discovered that after executing the last instruction, the CPU didn&#8217;t know the program had ended. It continued reading and executing past the end of the .text section in memory, causing a crash. So, we need some way to make our process exit or stop the execution before that happens. How can we do that?</p><p>When we ran our program using the shell command <code>./false</code>, it was the shell that invoked the <a href="https://man7.org/linux/man-pages/man2/fork.2.html">fork</a> and <a href="https://man7.org/linux/man-pages/man2/execve.2.html">execve</a> system calls. These created a new process, loaded our program into memory, and scheduled it for execution on the CPU. Similarly, to terminate our program gracefully, we need to invoke another system call that tells the kernel our process is done.</p><p>This system call to exit a process is called &#8220;<code>exit</code>&#8221;. When we write code in high-level languages, the runtime automatically invokes it after the main function returns. However, when writing freestanding assembly, we need to do it ourselves. For that, we need to learn how to call syscalls from assembly.</p><p>In this part, we will:</p><ul><li><p>Understand what system calls are</p></li><li><p>Learn how to invoke them in assembly</p></li><li><p>Fix our crashing program step-by-step</p></li><li><p>Write a second assembly program using getpid</p></li><li><p>Hands-on exercise: a limited version of the kill command </p></li></ul><div><hr></div><blockquote><p><strong>Recap</strong>: <em>If you haven&#8217;t seen the previous articles in the series, here&#8217;s what you have missed:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;58f82a9d-ca8e-4f2a-b0cf-054daf674c92&quot;,&quot;caption&quot;:&quot;&#8220;Do not try to bend the spoon. That's impossible. Instead, only try to realize the truth... there is no spoon.&#8221; &#8212; The Matrix&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Understanding Computer Organization from First Principles&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-04-05T17:54:52.832Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/seeing-the-matrix&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:160249113,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:105,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;81e8a5a6-342b-439a-9366-8ebcf8273649&quot;,&quot;caption&quot;:&quot;We wrapped up the X86-64 assembly course last week, and I&#8217;ll be sharing notes from the sessions here as a series of articles. While the live sessions covered much more ground, I think you&#8217;ll find these write-ups valuable in their own right. I&#8217;ll be publishing them gradually over the next few weeks.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The System-Level Foundation of Assembly&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-05-05T08:36:27.146Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77cbf196-9752-4ec4-a24e-0ad8e8124cb3_627x551.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/the-system-level-foundation-of-assembly&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:162823255,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:24,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;bfe1386d-57ea-464f-9467-a2903fba9652&quot;,&quot;caption&quot;:&quot;In our previous article, we explored how computers work from transistors up to program execution. We saw how digital circuits built from logic gates perform calculations using binary data, and how the ALU executes operations on this binary representation.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Binary Arithmetic and Bitwise Operations for Systems Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-04-12T05:16:14.645Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/binary-arithmetic-and-bitwise-operations&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161089202,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:27,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;54c2febb-0eaf-425d-a04a-0ef5ee1beae1&quot;,&quot;caption&quot;:&quot;Introduction&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Building (and Breaking) Your First X86 Assembly Program&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-05-16T14:33:57.354Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44d6555-a755-40ae-b180-dfccbddcaad2_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/building-and-breaking-your-first&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:160056784,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:3,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;faf6c8b7-7250-4fc8-ba12-b932c6c37ff6&quot;,&quot;caption&quot;:&quot;We ended the last article with a minimal x86-64 assembly program that assembled and ran, but then crashed with a segmentation fault. Before we move on to fix that properly, this is a good opportunity to step back and understand how to debug such issues.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Debugging X86-64 Assembly with GDB&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-05-26T18:13:24.190Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/714d7235-7026-4e6c-a309-354ebada3991_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/debugging-x86-64-assembly-with-gdb&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:164500692,&quot;type&quot;:&quot;podcast&quot;,&quot;reaction_count&quot;:30,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><div><hr></div><p><em><strong>This article is part of a paid subscriber series.</strong><br>If you&#8217;re enjoying the content, please consider upgrading to a paid plan to unlock the rest of this series. Paid subscribers also get  discounted access to courses and books, and the rest of the archive. </em></p><p><em>Alternatively, you can purchase an ebook version of this series. (If you're already a paid subscriber, email me for a discounted link.)</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk&quot;,&quot;text&quot;:&quot;I Want the PDF&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://codingconfessions.gumroad.com/l/ychdk"><span>I Want the PDF</span></a></p><div><hr></div><h2><strong>Understanding System Calls</strong></h2><p>Before we learn how to invoke system calls, let&#8217;s first understand why they exist.</p><p>Modern operating systems serve two roles: they manage the execution of programs on the CPU and provide safe, unified access to hardware resources like files, memory, and networks. But, application code cannot directly access these hardware features. Why not?</p><p>There are three main reasons:</p><ul><li><p><strong>Hardware abstraction</strong>: Devices vary widely in design and interface. The OS hides this complexity by exposing a uniform way to access them. Whether you're reading from an SSD or a magnetic disk, you use the same system call (read), and the OS handles the details.</p></li><li><p><strong>Portability</strong>: Most modern OSes follow the<a href="https://en.wikipedia.org/wiki/POSIX"> POSIX</a> standard, which defines a consistent set of system calls. If your application uses only POSIX-compliant syscalls, it can compile and run on any compliant OS with minimal changes.</p></li><li><p><strong>Security</strong>: If user programs could directly access memory or I/O devices, they could corrupt system state or access other processes&#8217; data. System calls act as a controlled gateway; only kernel code (running in a higher privilege level) is allowed to interact directly with hardware.</p></li></ul><p>This separation of privilege is enforced by the CPU. On x86, the kernel runs in ring 0 (full privilege), while user programs run in ring 3 (restricted mode). All system calls are implemented inside the kernel at ring 0. To invoke them from ring 3, user space programs need a way to trigger a transition into kernel mode using a mechanism provided by the CPU.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vkx7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vkx7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png 424w, https://substackcdn.com/image/fetch/$s_!Vkx7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png 848w, https://substackcdn.com/image/fetch/$s_!Vkx7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png 1272w, https://substackcdn.com/image/fetch/$s_!Vkx7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vkx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png" width="911" height="641" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:641,&quot;width&quot;:911,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The protection rings in x86 architecture. Kernel runs at ring-0 level which is the highest privilege mode, while user space in ring-3 which is the least privilege mode&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The protection rings in x86 architecture. Kernel runs at ring-0 level which is the highest privilege mode, while user space in ring-3 which is the least privilege mode" title="The protection rings in x86 architecture. Kernel runs at ring-0 level which is the highest privilege mode, while user space in ring-3 which is the least privilege mode" srcset="https://substackcdn.com/image/fetch/$s_!Vkx7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png 424w, https://substackcdn.com/image/fetch/$s_!Vkx7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png 848w, https://substackcdn.com/image/fetch/$s_!Vkx7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png 1272w, https://substackcdn.com/image/fetch/$s_!Vkx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf77686-747a-47bd-9608-ebd6b479dfa8_911x641.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The protection rings in x86 architecture. Kernel runs at ring-0 level which is the highest privilege mode, while user space in ring-3 which is the least privilege mode</figcaption></figure></div><h2><strong>Invoking System Calls on x86-64</strong></h2>
      <p>
          <a href="https://blog.codingconfessions.com/p/making-system-calls-in-x86-64-assembly">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[One Law to Rule Them All: The Iron Law of Software Performance]]></title><description><![CDATA[A systems-level reasoning model for understanding why optimizations succeed or fail.]]></description><link>https://blog.codingconfessions.com/p/one-law-to-rule-all-code-optimizations</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/one-law-to-rule-all-code-optimizations</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Sun, 08 Jun 2025 17:27:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!40OK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p>&#8220;One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them.&#8221;</p><p>&#8212; <em>J.R.R. Tolkien</em></p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!40OK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!40OK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!40OK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!40OK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!40OK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!40OK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148320,&quot;alt&quot;:&quot;One law to rule them law: the iron law of performance&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="One law to rule them law: the iron law of performance" title="One law to rule them law: the iron law of performance" srcset="https://substackcdn.com/image/fetch/$s_!40OK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!40OK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!40OK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!40OK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fc20805-ddd6-496c-ae7c-9d1cf9d28318_1536x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">One law to rule them all: the iron law of performance</figcaption></figure></div><p>Software optimizations are messy and often unpredictable<strong>.</strong> Whether you see a win is not guaranteed, and the reasons are usually unclear. Is there a way to reason about this?</p><p>Maybe there is. In this article, I show you <strong>one law</strong> that explains all low-level code optimizations: when they work, and when they don&#8217;t. It&#8217;s based on the <em><a href="https://en.wikipedia.org/wiki/Iron_law_of_processor_performance">Iron Law of Performance</a></em>, a model widely known in the hardware world but relatively obscure in software circles.</p><p>What we&#8217;ll see is that almost every low-level optimization, whether it's loop unrolling, SIMD vectorization, or branch elimination, ultimately affects just <strong>three metrics</strong>: the number of instructions executed, the number of cycles needed to execute them, and the duration of a single cycle. The Iron Law ties them together and gives us a <strong>unified reasoning model</strong> for software performance.</p><blockquote><p><em>Of course, not all software optimizations fit into this model. Things like algorithmic improvements, contention removal, or language-level tuning (like garbage collection) lie outside its scope. I&#8217;m not claiming the Iron Law explains those.</em></p></blockquote><p><strong>What&#8217;s inside:</strong></p><ul><li><p><em>The Iron Law of Performance for software</em></p></li><li><p><em>Loop unrolling: reducing dynamic instructions</em></p></li><li><p><em>Function inlining: boosting IPC through linearization</em></p></li><li><p><em>SIMD vectorization: trading instruction count for complexity</em></p></li><li><p><em>Branch prediction: reducing pipeline flushes</em></p></li><li><p><em>Cache misses: backend stalls and instruction throughput</em></p></li><li><p><em>A reasoning framework to guide optimization decisions</em></p></li></ul><div><hr></div><h2><a href="https://coderabbit.link/abhinav">Cut Code Review Time &amp; Bugs in Half (Sponsored)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://coderabbit.link/abhinav" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DpxV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 424w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 848w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 1272w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DpxV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://coderabbit.link/abhinav&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DpxV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 424w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 848w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 1272w, https://substackcdn.com/image/fetch/$s_!DpxV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ddb0d8c-1669-4bda-86b9-02ee3810915f_1600x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Get Started with Code Rabbit Today to Simplify your Code Reviews</figcaption></figure></div><p>Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.</p><p>Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.</p><p>CodeRabbit has so far reviewed more than 10 million PRs, installed on 1 million repositories, and used by 70 thousand Open-source projects. CodeRabbit is free for all open-source repos.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://coderabbit.link/abhinav&quot;,&quot;text&quot;:&quot;Get Started Today&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://coderabbit.link/abhinav"><span>Get Started Today</span></a></p><div><hr></div><h5>Background Read:</h5><h5>This article assumes some knowledge of CPU microarchitecture, and optimization techniques such as branch elimination, loop unrolling etc. If you are unfamiliar with these, I recommend the following two articles:</h5><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;cad94cee-2810-4d2c-ba33-bb636d52ebd0&quot;,&quot;caption&quot;:&quot;Even the most elegant algorithms can run painfully slow when they fight against your computer's underlying hardware. The difference between mediocre and exceptional performance often comes down to whether your code works with, or against the CPU's architecture.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Hardware-Aware Coding: CPU Architecture Concepts Every Developer Should Know&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-03-21T11:11:05.104Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1f511d0-2519-4282-bdfd-21af1c5b744d_1472x832.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/hardware-aware-coding&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:158157210,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:115,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d2c22a1d-aab8-49e8-a86c-ccee9c9bc67d&quot;,&quot;caption&quot;:&quot;Simultaneous multithreading (SMT) is a feature that lets a processor handle instructions from two different threads at the same time. But have you ever wondered how this actually works? How does the processor keep track of two threads and manage its resources between them?&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Two Threads, One Core: How Simultaneous Multithreading Works Under the Hood&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-07-24T10:28:38.815Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1465447142348-e9952c393450?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxmb3JrZWQlMjByb2FkfGVufDB8fHx8MTcyMTgxMDM4N3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/simultaneous-multithreading&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:146234191,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:52,&quot;comment_count&quot;:3,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><h2>The Iron Law of Performance (Hardware)</h2><p>First, let&#8217;s start by understanding the Iron Law for hardware performance. It is a simple equation that models the performance of the hardware in the context of executing a program. This depends on three factors:</p><ol><li><p>Number of instructions executed (also known as the dynamic instruction count)</p></li><li><p>Average number of cycles needed to execute those instructions (cycles per instruction or CPI)</p></li><li><p>Time taken to execute a single CPU cycle (clock cycle time)</p></li></ol><p>The following equation defines the law:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\text{Performance} = \\frac{1}{\\text{Instruction Count} \\times \\text{CPI} \\times \\text{Clock Cycle Time}}\n\n&quot;,&quot;id&quot;:&quot;VDTAZJBBBV&quot;}" data-component-name="LatexBlockToDOM"></div><p>CPU architects use this to analyse how an architectural change impacts the performance of the processor. For example, should they increase the depth of the instruction pipeline?</p><p>In a pipelined processor, an instruction moves from one pipeline stage to another in one cycle. Naturally, the cycle time is dependent on the slowest pipeline stage. When you increase the pipeline depth, you breakdown some of the stages into more granular parts, thus reducing the work done in each stage and in turn reducing the cycle time. It means that now the processor can execute more cycles per second.</p><p>However, increasing pipeline depth also raises the penalty of cache and branch misses. For example, accessing main memory still takes about 100&#8239;ns, which translates to 100 cycles at 1&#8239;GHz but doubles to about 200 cycles at 2&#8239;GHz when cycle time is halved. Likewise, deepening the pipeline from 15 to 20 stages also increases the branch misprediction penalty from ~15 to ~20 cycles.</p><p>These increased latencies and penalties make the average CPI go up as well. So, whether the pipeline depth should be increased and by how much depends on the overall tradeoff. The iron law gives a very simple framework to make these decisions.</p><p>When you do low-level software performance optimizations, similar tradeoffs apply. Every optimization affects the program instruction count, cycles per instruction, and sometimes even the CPU clock frequency. So, it makes sense to apply the same model to analyse and reason about software-level optimizations as well. Let&#8217;s try to do that in the next few sections.</p><div><hr></div><h2>The Iron Law of Performance for Software</h2><p>In the context of software, we can slightly tweak the law to the following form:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Performance} \\propto \\frac{IPC}{\\text{Instruction count} \\times {\\text{Clock Cycle Time}}}&quot;,&quot;id&quot;:&quot;SRGAQYWIFQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, we&#8217;ve replaced <em>CPI</em> (cycles per instruction) with <em>IPC</em> (instructions per cycle). Although they&#8217;re mathematical inverses, IPC is more intuitive for software engineers: for example, modern x86 processors can retire up to 4 instructions per cycle, so IPC gives a clearer sense of how close we are to the peak throughput.</p><p>We&#8217;ve also relaxed the equality to a proportionality. When optimizing software, we&#8217;re not looking for exact numbers, rather we&#8217;re reasoning about trade-offs.</p><p>So, what role do these three terms play in software performance?</p><ul><li><p><strong>Increasing IPC</strong> means the CPU can retire more instructions per cycle, reducing total execution time.</p></li><li><p><strong>Lowering the dynamic instruction count</strong> means fewer instructions need to be executed overall. In general, this means the CPU needs to do less work and performance should go up. </p></li><li><p><strong>Lowering the clock frequency</strong>, as sometimes happens with power-hungry instructions (e.g., AVX-512), increases cycle time and harms performance.</p></li></ul><p>We&#8217;ll now apply this model to analyze several well-known optimizations: loop unrolling, function inlining, SIMD vectorization, branch elimination, and cache optimizations, and see how each one shifts the Iron Law variables.</p><blockquote><p><strong>Note</strong>: These factors themselves are not quite independent from each other. For example, when you reduce the dynamic instruction count of your program, you need to be careful about instruction selection. </p><p>As an example, integer additions have a very low latency, whereas integer divisions are expensive. So, if you reduce the instruction count in your program by replacing a high number of integer additions with a small number of integer divisions, your performance may not improve, or it could even degrade. It depends on the overall tradeoff in the decrease in instruction count and the drop in IPC. Whichever factor wins, dictates the performance.</p></blockquote><div><hr></div><h2>Loop Unrolling</h2><p>Loop unrolling is a classical optimization. Instead of executing one step of the loop body per iteration, you rewrite the loop to execute multiple steps per iteration. Consider the following loop that computes the sum of an integer array.</p><pre><code>int sum = 0;
for (int i = 0; i &lt; n; i++) {
    sum += arr[i];
}</code></pre><p>If we unroll this loop four times, it will look like the following:</p><pre><code>int sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
int i = 0;

// Process 4 elements at a time
for (; i + 3 &lt; n; i += 4) {
    sum0 += arr[i];
    sum1 += arr[i + 1];
    sum2 += arr[i + 2];
    sum3 += arr[i + 3];
}

// Handle the remainder
int sum = sum0 + sum1 + sum2 + sum3;
for (; i &lt; n; i++) {
    sum += arr[i];
}</code></pre><p>Now, let&#8217;s reason about how such an optimization can improve the performance of a program and what are the tradeoffs to consider, i.e., in what situations it may not deliver performance improvements.</p><blockquote><p><em><strong>Note: Usually, you don't have to unroll a loop yourself.</strong> The compiler does it when it sees it will deliver better performance. But sometimes it may not do that because it cannot guarantee program correctness due to limited knowledge or constraints about the code. So it is useful to be aware of it.</em></p></blockquote><h3>Impact on Instruction Count</h3><p>For large <code>n</code>, unrolling reduces the dynamic instruction count. In the example shown above, the loop body executes three instructions: a comparison for loop condition, incrementing the loop counter, and updating the sum. So, the normal loop executes <code>3n</code> instructions.</p><p>A 4 times unrolled loop executes: one loop comparison, one loop index increment and four additions - so six instructions per iteration, and <code>6n/4 = 1.5n</code> instructions for a vector of size <code>n</code>.</p><p>In Iron Law terms, we&#8217;ve driven down the <em>Instruction Count</em> by nearly 50 % (for large <code>n</code>), all else equal. This sounds like an obvious performance win, but we need to also look at how this impacts the IPC.</p><h3>Impact on IPC</h3><p>Recall from our Iron Law that <em>Performance &#8733; IPC / Instruction Count</em>. We&#8217;ve already reduced instruction count, so if we can raise IPC (or at least not lower it), net performance improves. Let&#8217;s see how unrolling affects IPC.</p><h4>Increased Instruction Level Parallelism</h4><p>The main advantage of loop unrolling is the potential increase in the instruction throughput of the program. The processor is capable of executing multiple instructions per cycle, e.g., the modern Intel processors can execute up to 4 instructions every cycle.</p><p>However, to achieve that kind of throughput, the processor needs to have enough independent instructions to execute, which is difficult. Usually, instructions have dependencies between them, i.e., the result produced by one instruction is consumed by the next. Such instructions cannot be executed in parallel.</p><p>For example, consider the assembly (generated by <code>-O1</code> flag to GCC) for the body of the normal for loop shown previously (without loop unrolling):</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hrqU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hrqU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png 424w, https://substackcdn.com/image/fetch/$s_!hrqU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png 848w, https://substackcdn.com/image/fetch/$s_!hrqU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png 1272w, https://substackcdn.com/image/fetch/$s_!hrqU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hrqU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png" width="847" height="170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:170,&quot;width&quot;:847,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25042,&quot;alt&quot;:&quot;Assembly for the body of the normal loop (without unrolling)&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Assembly for the body of the normal loop (without unrolling)" title="Assembly for the body of the normal loop (without unrolling)" srcset="https://substackcdn.com/image/fetch/$s_!hrqU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png 424w, https://substackcdn.com/image/fetch/$s_!hrqU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png 848w, https://substackcdn.com/image/fetch/$s_!hrqU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png 1272w, https://substackcdn.com/image/fetch/$s_!hrqU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dcac4b6-602a-42fc-8d72-6aa0526bee41_847x170.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Assembly for the body of the normal loop (without unrolling)</figcaption></figure></div><p>Let me explain what is going on:</p><ul><li><p>The register <code>edx</code> holds the current value of the sum and at every iteration the value of <code>arr[i]</code> gets added to it.</p></li></ul><ul><li><p>The register <code>rax</code> holds the address of the current array element, <code>arr[i]</code>. At the beginning of the loop iteration, the value of <code>arr[i]</code> gets added to the sum value in <code>edx</code>. Then in the 2nd instruction <code>rax</code> gets incremented by 4, which means now <code>rax</code> contains the address of the next array element <code>arr[i + 1]</code>.</p></li></ul><ul><li><p>Finally, the last two instructions check if we have reached the end of the array, and if not then jump back to the beginning of the loop.</p></li></ul><p>So, for a large array, the CPU is going to be executing these four instructions for a while. Can it execute some of these in parallel so that the loop finishes faster? Not quite. The instructions have dependencies between them that stop the CPU from doing that. </p><p>Notice that the first <code>addl</code> instruction that updates the sum in <code>edx</code> depends on the previous iteration's <code>edx</code> and <code>rax</code> values, so the CPU can't issue the next iteration's <code>addl</code> until the previous iteration's <code>addl</code> and <code>addq</code> instructions are finished. In other words, there simply aren't any independent instructions for the CPU to execute in parallel.</p><p>Loop unrolling fixes this problem. The following assembly code is the loop body for the unrolled code shown previously.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RVmP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RVmP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png 424w, https://substackcdn.com/image/fetch/$s_!RVmP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png 848w, https://substackcdn.com/image/fetch/$s_!RVmP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png 1272w, https://substackcdn.com/image/fetch/$s_!RVmP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RVmP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png" width="753" height="245" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:245,&quot;width&quot;:753,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43969,&quot;alt&quot;:&quot;Assembly instructions for the four-times unrolled loop body&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Assembly instructions for the four-times unrolled loop body" title="Assembly instructions for the four-times unrolled loop body" srcset="https://substackcdn.com/image/fetch/$s_!RVmP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png 424w, https://substackcdn.com/image/fetch/$s_!RVmP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png 848w, https://substackcdn.com/image/fetch/$s_!RVmP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png 1272w, https://substackcdn.com/image/fetch/$s_!RVmP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14394ab3-0121-4cc1-8ddf-dab987ba31d1_753x245.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Assembly instructions for the four-times unrolled loop body</figcaption></figure></div><p>In the unrolled loop, the compiler assigns each partial sum to its own register (<code>edx</code>, <code>edi</code>, <code>esi</code>, <code>ecx</code>), so each <code>addl</code> instruction uses a different register and memory address. This makes these instructions independent and the CPU can issue and execute those in parallel, reducing the number of cycles needed to finish the loop, and improving the IPC.</p><p>In Iron Law terms, we&#8217;ve increased <em>IPC</em> from roughly 1.0 (due to dependencies) to perhaps ~3.0 or higher, depending on execution port availability. Combined with a 50 % drop in instruction count, that yields a significant net gain.</p><blockquote><p><em><strong>Note</strong>: How many of these add instructions will actually execute in parallel depends on how many functional units are there in the CPU to perform integer addition. So how much unrolling to do also depends on what kinds of instructions are there to execute.</em></p></blockquote><p>However, it isn&#8217;t all rosy and shiny. Loop unrolling can also hamper the IPC in other ways. Let&#8217;s see how. </p><h4><strong>Register Spills Due to Increased Loop Body Size</strong></h4><p>Unrolling the loop creates many local variables and increases the demand for registers. When unrolled too many times, or when unrolling a large complicated loop, it can result in register spills. It means that for some variables the compiler will have to use the stack when it runs out of registers.</p><p>When register spills happen, the instructions that read data from the stack instead of registers take longer to finish. While a value can be read from a register within a single cycle, reading from stack can take 3-4 cycles (assuming an L1 cache hit).</p><p>So, the operations that could be done in a single cycle will now take several cycles due to memory access. In such situations, if the CPU doesn&#8217;t have other instructions to execute, it will sit idle and waste resources. This increases the average cycles per instruction and lowers the IPC.</p><p>In Iron Law terms, register spills reduce <em>IPC</em>, which can partially or completely negate the instruction count reduction. You must weigh these together.</p><blockquote><p><strong>Note: </strong><em>It is not guaranteed that a register spill will necessarily drop the IPC, because sometimes the compiler can schedule the instructions better to keep the CPU busy while other instructions are stalled on memory access. But it is a tradeoff that you risk introducing when being too aggressive with this optimization.</em></p></blockquote><h4>Instruction Cache Pressure Due to Increased Loop Body Size</h4><p>Another potential impact of unrolling the loop is the increased code footprint that can cause pressure on instruction cache. These days the instruction caches are large enough that unrolling a loop will not result in cache misses for the loop itself, but in larger systems where there is existing pressure on the instruction cache, this may cause eviction of other instructions.</p><p>For example, most x86 cores have a 32-64 KB L1 I-cache. For a loop body consisting of four instructions of 4 bytes each, unrolling it 4 times may increase the code size by ~64 bytes but that is negligible.</p><p>So, in general it is not a huge concern for high-end CPUs. But we still need to be aware of the tradeoff from Iron Law's perspective because increased instruction cache misses lower how fast the CPU frontend can send instructions to the backend for execution, thus lowering the IPC.</p><h3>Loop Unrolling from the Lens of the Iron Law</h3><p>Below is a summary of trade-offs when viewing loop unrolling through the Iron Law.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_6zb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_6zb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png 424w, https://substackcdn.com/image/fetch/$s_!_6zb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png 848w, https://substackcdn.com/image/fetch/$s_!_6zb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png 1272w, https://substackcdn.com/image/fetch/$s_!_6zb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_6zb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png" width="1203" height="236" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:236,&quot;width&quot;:1203,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Summary of loop unrolling and its tradeoffs from Iron Law&#8217;s lens&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Summary of loop unrolling and its tradeoffs from Iron Law&#8217;s lens" title="Summary of loop unrolling and its tradeoffs from Iron Law&#8217;s lens" srcset="https://substackcdn.com/image/fetch/$s_!_6zb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png 424w, https://substackcdn.com/image/fetch/$s_!_6zb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png 848w, https://substackcdn.com/image/fetch/$s_!_6zb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png 1272w, https://substackcdn.com/image/fetch/$s_!_6zb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d19f2-3925-4bc6-bffc-c23395259bf0_1203x236.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Summary of loop unrolling and its tradeoffs from Iron Law&#8217;s lens</figcaption></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Writing this article took me several days and nights. You can support my work by becoming a paid subscriber. As a paid subscriber you get early access to all articles, exclusive articles and discounted access to courses/books.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>Function Inlining</h2><p>Next, let&#8217;s see how function inlining shifts the balance between instruction count and IPC. It is a simple optimization that the compiler routinely performs where it inlines a function call. It means that it replaces the function call with the body of the function being called to avoid the overhead of the function call. Let&#8217;s understand with an example.</p><p>Consider the following C function and its assembly (generated by GCC with -O1)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EgrZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EgrZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png 424w, https://substackcdn.com/image/fetch/$s_!EgrZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png 848w, https://substackcdn.com/image/fetch/$s_!EgrZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png 1272w, https://substackcdn.com/image/fetch/$s_!EgrZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EgrZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png" width="1234" height="397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:397,&quot;width&quot;:1234,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79054,&quot;alt&quot;:&quot;A simple C function and its compiler generated assembly &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A simple C function and its compiler generated assembly " title="A simple C function and its compiler generated assembly " srcset="https://substackcdn.com/image/fetch/$s_!EgrZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png 424w, https://substackcdn.com/image/fetch/$s_!EgrZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png 848w, https://substackcdn.com/image/fetch/$s_!EgrZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png 1272w, https://substackcdn.com/image/fetch/$s_!EgrZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef28856-ae77-4685-8a8a-8cf8cff37be4_1234x397.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A simple C function and its compiler generated assembly </figcaption></figure></div><p>Function calls incur hidden costs beyond the core logic. When this function executes, several additional steps occur:</p><ul><li><p><strong>Stack Frame Setup</strong>: A stack frame needs to be set up for the function to manage the function local data on the stack. At the minimum, it requires saving the <code>rbp</code> register on the stack and then copying the current value of the <code>rsp</code> register (the stack pointer) into <code>rbp</code>. So, at least two instructions. These days the compilers optimize this away if they notice that the function doesn&#8217;t use the stack, but it is not guaranteed.</p></li></ul><ul><li><p><strong>Saving and Restoring Callee Registers</strong>: As per the <a href="https://wiki.osdev.org/System_V_ABI">System V AMD64 calling convention</a>, certain registers are required to be saved by the callee function if it needs to use them. This is needed because those registers might be in use by the caller and if the callee doesn&#8217;t preserve the previous values, then the caller&#8217;s state will be corrupted. So, sometimes you will notice code to save and restore these registers as well. In the case of the code shown above, the function is simple enough that it doesn&#8217;t need to do this. But it is also a potential cost of calling functions.</p></li></ul><ul><li><p><strong>Destroying Stack Frame</strong>: As the function returns, it needs to destroy the stack frame and restore the stack in the state as it was before the call.This again incurs extra set of instructions as you can see in the assembly.</p></li></ul><ul><li><p><strong>Function Return</strong>: Finally, the <code>ret</code> instruction is required to return the control from the function back to the caller.</p></li></ul><p>Apart from the extra work inside the called function, calling the function also requires extra instructions. The following figure shows a <code>main</code> function calling <code>compute</code>, and on the right hand side you can see the assembly code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LfC3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LfC3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png 424w, https://substackcdn.com/image/fetch/$s_!LfC3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png 848w, https://substackcdn.com/image/fetch/$s_!LfC3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png 1272w, https://substackcdn.com/image/fetch/$s_!LfC3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LfC3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png" width="1354" height="702" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:702,&quot;width&quot;:1354,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:159401,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LfC3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png 424w, https://substackcdn.com/image/fetch/$s_!LfC3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png 848w, https://substackcdn.com/image/fetch/$s_!LfC3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png 1272w, https://substackcdn.com/image/fetch/$s_!LfC3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc04115b-ce63-4287-b34b-64fdd73bbe3a_1354x702.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The extra instructions required to call a function</figcaption></figure></div><p>So, what are the overheads while making the function call?</p><ul><li><p><strong>Saving/Restoring Caller Saved Registers</strong>: The Sys V AMD64 ABI defines certain registers as caller saved registers, meaning that the caller needs to save these registers on the stack before making the function call so that the callee function can use these registers freely. In the above code for <code>main</code>, you can see it saving the values of <code>rax</code> and <code>rdx</code> on the stack using the <code>push</code> instruction. In this case, the compiler does not care about restoring them back, but usually you also need to restore them back after the function call returns by a corresponding <code>pop</code> instruction. So, you have two extra instructions per saved register.</p></li></ul><ul><li><p><strong>Setting up Function Arguments</strong>: Before invoking the function call the caller needs to set up the function call arguments. The Sys V AMD64 ABI designates certain registers in which these arguments can be passed. In the assembly code, I&#8217;ve highlighted the instructions which set up the registers.</p></li></ul><ul><li><p><strong>Calling the Function</strong>: Calling the function requires an extra <code>call</code> instruction. Compared to everything else this looks like a small cost, but nevertheless, when executing a function very frequently, it adds up.</p></li></ul><p>So, when you inline a function, you <strong>save all this extra work</strong> that the CPU needs to do each time the function is called. The following figure shows a version of the program where the <code>compute</code> function has been inlined in <code>main</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!go4W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!go4W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png 424w, https://substackcdn.com/image/fetch/$s_!go4W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png 848w, https://substackcdn.com/image/fetch/$s_!go4W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png 1272w, https://substackcdn.com/image/fetch/$s_!go4W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!go4W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png" width="856" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:856,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66518,&quot;alt&quot;:&quot;The assembly of the main function after inlining the compute function&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The assembly of the main function after inlining the compute function" title="The assembly of the main function after inlining the compute function" srcset="https://substackcdn.com/image/fetch/$s_!go4W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png 424w, https://substackcdn.com/image/fetch/$s_!go4W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png 848w, https://substackcdn.com/image/fetch/$s_!go4W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png 1272w, https://substackcdn.com/image/fetch/$s_!go4W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fe59c12-c263-4d52-86a5-cd2677ed4fc9_856x620.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The assembly of the main function after inlining the compute function</figcaption></figure></div><p>As you can see, inlining has streamlined the entire flow. The extra instructions are gone and only the core logic of the inlined function remains.</p><p>Now, it may look like an obvious performance win but let&#8217;s analyse from the perspective of the iron law.</p><h3>Impact on Program Instruction Count</h3><p>Function inlining reduces the dynamic instruction count of the program (number of instructions executed) in two ways. One is directly due to avoiding the function call overhead, and second is by compiler optimizations that get unlocked after inlining.</p><h4>Direct Reduction in Instruction Count</h4><ul><li><p>call/ret instruction elimination</p></li></ul><ul><li><p>stack frame setup/teardown elimination</p></li></ul><ul><li><p>function argument handling elimination</p></li></ul><ul><li><p>register save/restore elimination</p></li></ul><p>Even if we conservatively assume a saving of 5 instructions for inlining a function that is called one million times, we save the CPU from executing 5 million extra instruction. This also gives the CPU cycles to execute other instructions, improving the overall IPC.</p><h4>Context-Sensitive Optimizations</h4><p>Apart from eliminating function call overhead, inlining gives the compiler opportunity to do further optimizations of the inlined code because it has more information about the context in which the function was being called.</p><p>For example consider the following code (<a href="https://sbaziotis.com/compilers/common-misconceptions-about-compilers.html#inlining-is-useful-primarily-because-it-eliminates-a-call-instruction">source</a>).</p><pre><code>int sat_div(int num, int denom) {
  if (denom == 0) {
    return (num &gt; 0) ? INT_MAX : INT_MIN;
  }
  return num / denom;
}

int foo(int a) {
  int b = sat_div(a, 3);
  return b;
}</code></pre><p>After inlining, the <code>sat_div</code> function into <code>foo</code>, the compiler may simplify it to following. It can do that because it knows that the 2nd parameter is always 3 when called from <code>foo</code>.</p><pre><code>int foo(int a) {
  // The generated code for this looks confusing because
  // the compiler has turned a division into a multiplication.
  return a / 3;
}</code></pre><p>These kind of optimizations may not always be possible, but it is a potential positive outcome that can further reduce the number of instructions that the CPU needs to execute.</p><h3>Impact on IPC</h3><p>Again, the impact of function inlining on the program IPC is not direct, but via many indirect factors. Let&#8217;s see how.</p><h4><strong>Increased Register Pressure</strong></h4><p>Inlining large functions with many variables can increase the register pressure and cause a potential spill onto the stack. The compiler needs to make a complex decision considering things like register pressure, function size and other factors. So, usually it may not inline a function if there is going to be a spill. But, if you are manually inlining a function, or forcing the compiler to inline, then it is a potential factor that may impact the IPC of your program.</p><h4>Increased Code Size</h4><p>Inlining functions results in a large code footprint because you are making copies of the function everywhere it is called. This increases instruction cache pressure and can cause increased cache misses. Frequent instruction cache misses can starve the CPU backend for new instructions to execute, causing the IPC to drop.</p><h4>Instruction Level Parallelism</h4><p>Inlining functions can increase instruction level parallelism because it eliminates the branch introduced in the code due to the function call. It gives the CPU a larger window of instructions to analyze that may enable it to find more work to do in parallel, thus improving the IPC.</p><h3>Function Inlining from the Lens of Iron Law</h3><p>Again, we can see that whether or not performance improvements are seen from function inlining depends on several factors. But eventually, all of these factors result in two metrics: the dynamic instruction count, and overall IPC of the program. The following table summarizes these tradeoffs.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rmb8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rmb8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png 424w, https://substackcdn.com/image/fetch/$s_!rmb8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png 848w, https://substackcdn.com/image/fetch/$s_!rmb8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png 1272w, https://substackcdn.com/image/fetch/$s_!rmb8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rmb8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png" width="1273" height="260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/365f764f-6395-4d4a-967d-b45491692181_1273x260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:260,&quot;width&quot;:1273,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46805,&quot;alt&quot;:&quot;Analysis of function inlining and its tradeoffs from the lens of the Iron Law&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Analysis of function inlining and its tradeoffs from the lens of the Iron Law" title="Analysis of function inlining and its tradeoffs from the lens of the Iron Law" srcset="https://substackcdn.com/image/fetch/$s_!rmb8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png 424w, https://substackcdn.com/image/fetch/$s_!rmb8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png 848w, https://substackcdn.com/image/fetch/$s_!rmb8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png 1272w, https://substackcdn.com/image/fetch/$s_!rmb8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F365f764f-6395-4d4a-967d-b45491692181_1273x260.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Analysis of function inlining and its tradeoffs from the lens of the Iron Law</figcaption></figure></div><p>With loop unrolling and inlining covered, next we&#8217;ll see how SIMD vectorization affects instruction count, IPC, and even clock frequency.</p><div><hr></div><h2>SIMD Vectorization</h2><p>Single instruction multiple data (SIMD) is a widely used optimization technique when the algorithm performs the same operation on multiple data elements. This is particularly applicable in numeric computing, image processing and similar domains. It improves the performance by significantly improving the IPC and lowering the dynamic instruction count, because the CPU is able to do more work in less number of instructions.</p><p>As an example, consider the following function and its assembly:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T4Xe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T4Xe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png 424w, https://substackcdn.com/image/fetch/$s_!T4Xe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png 848w, https://substackcdn.com/image/fetch/$s_!T4Xe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png 1272w, https://substackcdn.com/image/fetch/$s_!T4Xe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T4Xe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png" width="1065" height="395" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:395,&quot;width&quot;:1065,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48330,&quot;alt&quot;:&quot;Left: C function to perform element wise addition of two float arrays and to store the result in a third array. Right: The assembly for the C function&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Left: C function to perform element wise addition of two float arrays and to store the result in a third array. Right: The assembly for the C function" title="Left: C function to perform element wise addition of two float arrays and to store the result in a third array. Right: The assembly for the C function" srcset="https://substackcdn.com/image/fetch/$s_!T4Xe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png 424w, https://substackcdn.com/image/fetch/$s_!T4Xe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png 848w, https://substackcdn.com/image/fetch/$s_!T4Xe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png 1272w, https://substackcdn.com/image/fetch/$s_!T4Xe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d3a6ee-58b3-47a9-a17a-17086816035a_1065x395.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Left: C function to perform element wise addition of two float arrays and to store the result in a third array. Right: The assembly for the C function</figcaption></figure></div><p>The function is performing vector addition. I&#8217;ve used the <code>-O1</code> flag which prevents the compiler from vectorizing it. The label <code>.L3</code> contains the loop body. It executes 4 instructions per element of the vectors. If we manually vectorize the code, or use the <code>-O3</code> optimization flag, we get the following assembly output which uses SIMD instructions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!npf7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!npf7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png 424w, https://substackcdn.com/image/fetch/$s_!npf7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png 848w, https://substackcdn.com/image/fetch/$s_!npf7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png 1272w, https://substackcdn.com/image/fetch/$s_!npf7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!npf7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png" width="983" height="2195" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2195,&quot;width&quot;:983,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:286870,&quot;alt&quot;:&quot;The SIMD optimized version of the same vector addition function&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The SIMD optimized version of the same vector addition function" title="The SIMD optimized version of the same vector addition function" srcset="https://substackcdn.com/image/fetch/$s_!npf7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png 424w, https://substackcdn.com/image/fetch/$s_!npf7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png 848w, https://substackcdn.com/image/fetch/$s_!npf7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png 1272w, https://substackcdn.com/image/fetch/$s_!npf7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f0f488-2421-4165-97cc-b88294564c8c_983x2195.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The SIMD optimized version of the same vector addition function</figcaption></figure></div><p>Here, the compiler has generated SIMD instructions like <code>vmovups</code> and <code>vaddps</code> which can process 8 float values (256 bits) in a single instruction. It means that the SIMD version of the loop can process 8 elements in just 3 instructions. Compare it to the non-vectorized version shown previously that was executing 4 instructions per array element.</p><p>Now, let&#8217;s analyze how SIMD vectorization impacts the program performance from the lens of the Iron Law.</p><h3>Impact on Instruction Count</h3><p>If you look at the vectorized assembly output above, it may appear as if the number of instructions have significantly increased. However, what matters is how many instructions that the CPU eventually executes.</p><p>Typically, for large datasets, the number of instructions executed by the CPU drops significantly when SIMD is used, because each instruction operates on multiple data points simultaneously.</p><p>In the example above, the scalar version of the loop executed roughly four instructions per element: two loads, one addition, and one store. In contrast, the vectorized version performed the same work using just three SIMD instructions, each operating on eight elements at once. This reduces the instruction count per element from 4 to 0.375, yielding a theoretical 10.66&#215; reduction in instruction count within the vectorized loop.</p><p>So, in general, vectorization results in a massive decrease in the instruction count executed by the CPU.</p><h3>Impact on IPC</h3><p>Let&#8217;s analyze how SIMD instructions impact the IPC of programs.</p><h4>Increased Code Size and its Effects on Instruction Cache</h4><p>Vectorization increases the overall code footprint. One reason is due to the complexity of the vectorization algorithm and extra instructions for processing the left over elements.</p><p>Apart from that, on x86, SIMD instructions tend to be longer than scalar ones (due to decades of backward-compatible extensions). When code is heavily vectorized, this increased instruction size leads to a larger static code footprint.</p><p>Also, some SIMD instructions have higher latency and can take a few cycles to produce their results. This means that the dependent instructions have to wait longer and the overall instruction throughput is lower. The trick to overcome this is to manually unroll the loop in code. Again, this means even larger code size.</p><p>In summary, vectorization inflates the static code footprint. This expanded footprint can evict hot code from the L1 instruction cache, causing front-end stalls and lowering IPC. </p><blockquote><p><em><strong>Note</strong>: While most tight vector loops will easily fit within a 32&#8211;64&#8239;KB I-cache, it is a tradeoff worth being aware of.</em></p></blockquote><h4>Impact on Instruction Fetch and Decode</h4><p>On x86, many SIMD instructions, especially AVX and AVX-512, are wider and more complex to decode than typical scalar instructions. This reduces how many instructions the CPU frontend can fetch and decode per cycle, lowering IPC.</p><p>Modern x86 processors can fetch 16 bytes and decode up to 4 instructions per cycle. With 3-byte scalar instructions, that&#8217;s enough bandwidth for peak throughput. But with 7-byte AVX-512 instructions, only 2 can be fetched per cycle, and decode throughput can drop to 1&#8211;2 instructions per cycle.</p><p>To mitigate this, CPUs use a &#956;op cache to store decoded instructions. During loop execution, this often hides the decode bottleneck. But for small vector sizes, where the loop runs only briefly, this overhead can dominate and the drop in IPC can negate any performance gains from vectorization.</p><h4>Register Pressure</h4><p>Vectorized code usually results in a much larger set of instructions. This results in more general-purpose registers being used as part of the loop being vectorized, increasing the register pressure.</p><p>Apart from that, sometimes the loop itself has to be unrolled to take into account the higher latency of some SIMD instructions. This requires using more SIMD registers, which puts pressure on the limited number of SIMD registers.</p><p>If vectorization results in register spills, frequent memory accesses can decrease the overall IPC due to the stalls while waiting for memory loads to return. If this becomes a dominant factor, it may eat up the gains provided by SIMD instructions.</p><blockquote><p><em><strong>Note</strong>: With AVX-512 having 32 registers, the chances of a spill are slim but it is worth mentioning this factor for a full coverage of all the tradeoffs.</em></p></blockquote><h3>Impact on Clock Cycle Time</h3><p>Remember that the Iron Law has three terms: dynamic instruction count, IPC and the CPU clock cycle time. So far all the optimizations we discussed only affected the first two factors, but SIMD instructions impact the clock frequency as well.</p><p>On x86, certain SIMD instructions (AVX2 and AVX-512) cause the CPU clock frequency to dynamically scale down to keep the power consumption and temperature in control. In other words, it increases the clock cycle time (negatively impacting the performance).</p><p>From Iron Law's perspective, vectorization lowers the dynamic instruction count, but also increases the clock cycle time. It results in an interesting tradeoff where the wins are not obvious.</p><p>SIMD wins are clearest when most work is vectorized, but in mixed workloads the reduced clock speed can hurt performance. For example, <a href="https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/">Cloudflare saw a 10% drop in web-server throughput</a> (requests served per second) when using vectorized hashing algorithms. They found that the CPU spent only 2.5% of the time in vectorized code, and the remaining time in scalar execution. In this case, the increased cycle times outweighed the SIMD gains&#8212;Iron Law in action.</p><h3>Analyzing SIMD from the Lens of Iron Law</h3><p>In summary, SIMD vectorization can slash instruction count and boost IPC, but it also increases cache footprint, register pressure, and may slow the clock. Here&#8217;s how these factors map to the Iron Law metrics:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A4Sa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A4Sa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png 424w, https://substackcdn.com/image/fetch/$s_!A4Sa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png 848w, https://substackcdn.com/image/fetch/$s_!A4Sa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png 1272w, https://substackcdn.com/image/fetch/$s_!A4Sa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A4Sa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png" width="1145" height="260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:260,&quot;width&quot;:1145,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43754,&quot;alt&quot;:&quot;Tradeoffs of SIMD vectorization from the lens of the Iron Law&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Tradeoffs of SIMD vectorization from the lens of the Iron Law" title="Tradeoffs of SIMD vectorization from the lens of the Iron Law" srcset="https://substackcdn.com/image/fetch/$s_!A4Sa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png 424w, https://substackcdn.com/image/fetch/$s_!A4Sa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png 848w, https://substackcdn.com/image/fetch/$s_!A4Sa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png 1272w, https://substackcdn.com/image/fetch/$s_!A4Sa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6d2fa7e-d56e-452b-a3e2-2c1445949240_1145x260.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Tradeoffs of SIMD vectorization from the lens of the Iron Law</figcaption></figure></div><div><hr></div><h2>Optimizing Branch Mispredictions</h2><p>Branch mispredictions are a common performance bottleneck. When the CPU predicts a branch incorrectly, it must discard the speculatively executed instructions and start fetching from the correct address. This flush costs around 15 to 20 cycles, and sometimes more depending on the pipeline depth. These stalls hurt the CPU&#8217;s instruction throughput and directly lower the IPC.</p><p>To understand why prediction is needed in the first place, consider how the CPU executes instructions. The backend, which performs the actual computation, is fast and parallel. But it depends entirely on the frontend to deliver a steady stream of decoded instructions. This pipeline works well when code is linear. But when there is a conditional branch, the next instruction depends on the result of a previous comparison instruction. Waiting for that result introduces bubbles in the pipeline and wastes cycles.</p><p>To avoid this delay, the CPU speculatively picks a direction using its branch predictor and fetches instructions along that path. If the guess is correct, the pipeline stays full and performance remains high. If the guess is wrong, the CPU has to roll back the speculative work, fetch the correct instructions, and refill the pipeline. This flushing not only delays execution but also wastes frontend bandwidth.</p><p>Even though modern branch predictors are highly accurate, often over 95 percent, some branches are unpredictable by nature. Others suffer because the predictor's limited buffer space gets overloaded in large, complex programs.</p><p>To see how costly this can be, imagine a loop with one million iterations and a branch that is mispredicted five percent of the time. At a penalty of 20 cycles per miss, this results in one million wasted cycles. That is enough to wipe out the benefit of most other optimizations.</p><p>Let&#8217;s now analyze how optimizing branches affects performance using the Iron Law framework.</p><h3>Impact on Instruction Count</h3><p>The exact impact on the dynamic instruction count depends on the specific optimization and how it changes the code execution. For example, you can reorder the conditions in a loop so that the most predictable and frequently executed branch is at the beginning, thus resulting in reduced branch misses. In this case, the overall number of executed instructions may drop.</p><p>Sometimes you may replace a branch with branchless logic using bitwise operations or other techniques. This usually results in more instructions but eliminates the branch prediction. So, you may end up with a higher number of instructions executed but save the misprediction penalties.</p><p>The point is that whether or not you see a significant improvement in performance largely depends on the IPC factor. Let&#8217;s analyze that.</p><h3>Impact on IPC</h3><p>Branch optimizations can impact the IPC in a variety of ways. </p><h4>Reduced Pipeline Stalls and Flushes</h4><p>When done well, reduced branch misses result in reduction in pipeline stalls and flushes, resulting in improved instruction delivery to the backend and an improved IPC. </p><h4>Impact on ILP</h4><p>Another aspect to consider with branch optimizations is their impact on instruction level parallelism.</p><p>Often, branchless implementations require more instructions than their branching counterparts and introduce serial dependencies. For example, consider the following function for conditionally swapping two values.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nk4m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nk4m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png 424w, https://substackcdn.com/image/fetch/$s_!nk4m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png 848w, https://substackcdn.com/image/fetch/$s_!nk4m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png 1272w, https://substackcdn.com/image/fetch/$s_!nk4m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nk4m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png" width="1148" height="271" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:271,&quot;width&quot;:1148,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43136,&quot;alt&quot;:&quot;C function and its assembly for conditionally swapping two values&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="C function and its assembly for conditionally swapping two values" title="C function and its assembly for conditionally swapping two values" srcset="https://substackcdn.com/image/fetch/$s_!nk4m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png 424w, https://substackcdn.com/image/fetch/$s_!nk4m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png 848w, https://substackcdn.com/image/fetch/$s_!nk4m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png 1272w, https://substackcdn.com/image/fetch/$s_!nk4m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4ab4b0-15b5-4134-85c2-67814c4a2850_1148x271.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">C function and its assembly for conditionally swapping two values</figcaption></figure></div><p>And the following figure shows a branchless way of implementing the same function along with its assembly output.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hijV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hijV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png 424w, https://substackcdn.com/image/fetch/$s_!hijV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png 848w, https://substackcdn.com/image/fetch/$s_!hijV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png 1272w, https://substackcdn.com/image/fetch/$s_!hijV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hijV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png" width="1380" height="347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:1380,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:77996,&quot;alt&quot;:&quot;A branchless way of implementing conditional swap&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A branchless way of implementing conditional swap" title="A branchless way of implementing conditional swap" srcset="https://substackcdn.com/image/fetch/$s_!hijV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png 424w, https://substackcdn.com/image/fetch/$s_!hijV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png 848w, https://substackcdn.com/image/fetch/$s_!hijV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png 1272w, https://substackcdn.com/image/fetch/$s_!hijV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F552f080e-cb28-48ba-90ce-3b33337fc939_1380x347.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A branchless way of implementing conditional swap</figcaption></figure></div><p>You can clearly see that the branchless version has more instructions. </p><p>But, the problematic part is that the branchless version results in dependent instructions, which lowers the ILP. In the example above, the compiler has used the <code>cmov</code> instructions that conditionally copy a value depending on the result of a condition. These instructions sidestep the branch predictor, but they introduce data dependencies. Any subsequent instruction that depends on a result produced by a prior <code>cmov</code> instruction has to wait until the <code>cmov</code> instruction has finished. This lowers the potential ILP.</p><p>Following figure shows another example of a branchless function to compute max of two integers. Again, you see that most instructions read and write the <code>eax</code> register, creating a dependency chain among them. The CPU will have to execute these sequentially, resulting in significantly lower ILP.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!otWx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!otWx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png 424w, https://substackcdn.com/image/fetch/$s_!otWx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png 848w, https://substackcdn.com/image/fetch/$s_!otWx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png 1272w, https://substackcdn.com/image/fetch/$s_!otWx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!otWx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png" width="768" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31449,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!otWx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png 424w, https://substackcdn.com/image/fetch/$s_!otWx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png 848w, https://substackcdn.com/image/fetch/$s_!otWx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png 1272w, https://substackcdn.com/image/fetch/$s_!otWx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4459bdd-cd1a-41e4-9d2d-c3f0ff971ced_768x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>So, the moral of the story is that <strong>branchless code comes at the cost of reduced ILP (and thus lower IPC).</strong> It is only worth doing when the branch predictor is exhibiting a high number of misses for that code. Otherwise, optimizing a branch which is highly predictable can backfire because the drop in IPC will dominate everything else. </p><h3>Analyzing Branch Optimization from the Lens of Iron Law</h3><p>Branch optimization is a tricky optimization to do but as far as its performance gains are concerned, they are relatively easy to reason about from the lens of iron law. Here&#8217;s a concise overview:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NOC1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NOC1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png 424w, https://substackcdn.com/image/fetch/$s_!NOC1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png 848w, https://substackcdn.com/image/fetch/$s_!NOC1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png 1272w, https://substackcdn.com/image/fetch/$s_!NOC1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NOC1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png" width="1098" height="212" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:212,&quot;width&quot;:1098,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25569,&quot;alt&quot;:&quot;Analysing branch optimizations from the lens of the Iron Law&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/164921281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Analysing branch optimizations from the lens of the Iron Law" title="Analysing branch optimizations from the lens of the Iron Law" srcset="https://substackcdn.com/image/fetch/$s_!NOC1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png 424w, https://substackcdn.com/image/fetch/$s_!NOC1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png 848w, https://substackcdn.com/image/fetch/$s_!NOC1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png 1272w, https://substackcdn.com/image/fetch/$s_!NOC1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F174f5c42-c7ae-4832-8b00-f82a8a232fc9_1098x212.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Analysing branch optimizations from the lens of the Iron Law</figcaption></figure></div><div><hr></div><h2>Analyzing Cache Misses from the Lens of Iron Law</h2><p>The final optimization we&#8217;ll discuss is minimizing cache misses<strong>.</strong> I&#8217;ll keep this section brief. The goal is to complete the picture and demonstrate that the Iron Law applies to almost every low-level optimization you might attempt.</p><p>Data cache misses can significantly reduce IPC. When an instruction misses the cache, it must wait hundreds of cycles for the data to arrive from main memory. During this time, it occupies CPU resources such as a reservation station slot, a reorder buffer slot, and a physical register. Any instructions that depend on the result of this instruction cannot proceed either. They remain in the reorder buffer until the data becomes available.</p><blockquote><p><em>If things like reorder buffer, reservation stations are new to you, I suggest reading my article on the <a href="https://blog.codingconfessions.com/p/simultaneous-multithreading">microarchitecture of SMT processors</a> which touches on these details.</em></p></blockquote><p>When enough of these long-latency instructions accumulate, they create backpressure in the backend. Reservation stations fill up, the reorder buffer approaches capacity, and eventually the frontend is forced to stop issuing new instructions. This limits parallelism and reduces the number of instructions executed per cycle.</p><p>Optimizations like structure padding, blocking, data layout transformations, and software prefetching aim to reduce miss rates and improve IPC. However, these often increase instruction count (both static and dynamic) because of added pointer arithmetic, bounds checks, or loop restructuring. The tradeoff is clear: does the reduction in stalls outweigh the overhead of extra instructions?</p><p>We will not go into further detail here, but the pattern should feel familiar by now.</p><div><hr></div><h2>Conclusion</h2><p>We began the article with a bold claim: that one law can explain all low-level code optimizations. After walking through multiple examples, that claim feels much less dramatic.</p><p>Most low-level optimizations shift one or more of the Iron Law variables: instruction count, IPC, or clock-cycle time. We often focus on isolated effects, like reducing cache misses or branches, and get confused when performance doesn&#8217;t improve. But that confusion disappears once we look at the bigger picture.</p><p>The Iron Law gives us a way to step back and see the full picture. It helps us reason about trade-offs clearly, without relying on guesswork. When combined with tools like <code>perf</code>, which show instruction counts, IPC, and backend stalls directly, it becomes easier to understand not just whether something changed, but whether it changed in the right direction.</p><p>This model doesn&#8217;t apply to every kind of optimization. It won&#8217;t help with algorithmic improvements or garbage collection tuning. But for the kind of low-level performance work that many developers struggle to reason about, it gives a clear lens.</p><p>So next time you&#8217;re tuning code, ask yourself which of the three metrics you&#8217;re moving. You might find that one law really does explain more than you expected.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/one-law-to-rule-all-code-optimizations?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/p/one-law-to-rule-all-code-optimizations?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>Found this insightful? A paid subscription ($6.50/month or $58/year) gives you early access to future posts, some exclusive content, and discounts on courses and books to level up faster.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Debugging X86-64 Assembly with GDB]]></title><description><![CDATA[Watch now (20 mins) | Learn how to inspect registers, step through instructions, and investigate crashes using GDB.]]></description><link>https://blog.codingconfessions.com/p/debugging-x86-64-assembly-with-gdb</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/debugging-x86-64-assembly-with-gdb</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Mon, 26 May 2025 18:13:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/164500692/60f99365639d093d80c55eaedd27f503.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>We ended the last article with a minimal x86-64 assembly program that assembled and ran, but then crashed with a segmentation fault. Before we move on to fix that properly, this is a good opportunity to step back and understand how to debug such issues.</p><p>In this part, we'll use <code>gdb</code> to investigate what exactly went wrong. You'll learn how to step through your program instruction by instruction, inspect memory and register values, and get a better sense of how the CPU executes your code. No new assembly instructions yet, just the tools to understand what you're building.</p><blockquote><p>If you haven&#8217;t seen the previous articles in the series, here&#8217;s what you have missed:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d7582c4b-e475-4057-a4bd-04e281c76388&quot;,&quot;caption&quot;:&quot;&#8220;Do not try to bend the spoon. That's impossible. Instead, only try to realize the truth... there is no spoon.&#8221; &#8212; The Matrix&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Understanding Computer Organization from First Principles&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-04-05T17:54:52.832Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/seeing-the-matrix&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:160249113,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:105,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6722fdc9-a75e-4481-bd38-0b51b631be3b&quot;,&quot;caption&quot;:&quot;We wrapped up the X86-64 assembly course last week, and I&#8217;ll be sharing notes from the sessions here as a series of articles. While the live sessions covered much more ground, I think you&#8217;ll find these write-ups valuable in their own right. I&#8217;ll be publishing them gradually over the next few weeks.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The System-Level Foundation of Assembly&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-05-05T08:36:27.146Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77cbf196-9752-4ec4-a24e-0ad8e8124cb3_627x551.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/the-system-level-foundation-of-assembly&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:162823255,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:24,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d49b743f-992c-4973-a36c-d2e1521e0a54&quot;,&quot;caption&quot;:&quot;In our previous article, we explored how computers work from transistors up to program execution. We saw how digital circuits built from logic gates perform calculations using binary data, and how the ALU executes operations on this binary representation.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Binary Arithmetic and Bitwise Operations for Systems Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-04-12T05:16:14.645Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/binary-arithmetic-and-bitwise-operations&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161089202,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:27,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;2df7daf2-d043-4abe-89db-30a6b70f83ef&quot;,&quot;caption&quot;:&quot;Introduction&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Building (and Breaking) Your First X86 Assembly Program&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-05-16T14:33:57.354Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44d6555-a755-40ae-b180-dfccbddcaad2_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/building-and-breaking-your-first&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:160056784,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:3,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><div><hr></div><p><em><strong>This article is part of a paid subscriber series.</strong><br>If you&#8217;re enjoying the content, please consider upgrading to a paid plan to unlock the rest of this series. Paid subscribers also get access to recordings of past live sessions, early and discounted access to courses and books, and more.</em></p><p><em>Alternatively, you can purchase an ebook version of this series. (If you're already a paid subscriber, email me for a discounted link.)</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk&quot;,&quot;text&quot;:&quot;I Want the PDF&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://codingconfessions.gumroad.com/l/ychdk"><span>I Want the PDF</span></a></p><div><hr></div><h2>Quick Recap</h2><p>Let&#8217;s start with a short recap of where we left things in the previous article.</p><p>We learned enough assembly to write the following program</p><pre><code><code># create the text section
.section .text

# mark the _start label as globally visible for linking
.globl _start 

_start:
  movq $32, %rdi
  movq $10, %rsi
  addq %rsi, %rdi</code></code></pre><p>Then we assembled and ran it as follows:</p><pre><code><code>as -o false.o false.s
ld -o false false.o
./false
Segmentation fault (core dumped)</code></code></pre><p>The program is mysteriously crashing, and our goal is to identify why. We will learn to use the debugger (gdb) to do this.</p><h2>Assembling with Debug Symbols</h2><p>To debug the program using gdb, the program binary needs to include the debug symbol table so that the debugger is able to show us the source code as we step through it. By default, the assembler and compiler don&#8217;t include the debug symbols in the binary because it increases the size of the program binary that can slow down the program startup time.</p><p>So first, we need to reassemble our program with debug symbols. To do that with the GNU assembler, we need to use the <code>gstabs</code> option, as shown below.</p><pre><code><code>as -o false.o -gstabs false.s</code></code></pre><p>The linking step remains the same. If the object file produced by the assembler contains the debug symbol tables, the linker will put them in the final executable without us asking it.</p><pre><code><code>ld -o false false.oHere's the program we wrote earlier that was crashing:</code></code></pre><h2>Debugging with GDB</h2><p>The simplest way to debug a program with gdb is to start the program with gdb as shown below.</p><pre><code><code>gdb ./false</code></code></pre><p>Once you do that, you will see the gdb prompt, like the screenshot below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NndE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NndE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png 424w, https://substackcdn.com/image/fetch/$s_!NndE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png 848w, https://substackcdn.com/image/fetch/$s_!NndE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png 1272w, https://substackcdn.com/image/fetch/$s_!NndE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NndE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png" width="916" height="599" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:599,&quot;width&quot;:916,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89999,&quot;alt&quot;:&quot;The GDB prompt&quot;,&quot;title&quot;:&quot;The GDB prompt&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161279838?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The GDB prompt" title="The GDB prompt" srcset="https://substackcdn.com/image/fetch/$s_!NndE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png 424w, https://substackcdn.com/image/fetch/$s_!NndE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png 848w, https://substackcdn.com/image/fetch/$s_!NndE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png 1272w, https://substackcdn.com/image/fetch/$s_!NndE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F066afb67-7424-4fc5-9b77-c683d4d16f6c_916x599.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The GDB prompt</figcaption></figure></div><p>The <code>(gdb)</code> thing is the GDB prompt, which is where we enter the debugging commands. The first command we will learn to use is to set up a breakpoint.</p><h3>Setting and Hitting Breakpoints</h3>
      <p>
          <a href="https://blog.codingconfessions.com/p/debugging-x86-64-assembly-with-gdb">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Building (and Breaking) Your First X86 Assembly Program]]></title><description><![CDATA[We build a minimal X86 assembly program, run it&#8230; and hit a crash. But that crash is exactly what makes this program worth writing.]]></description><link>https://blog.codingconfessions.com/p/building-and-breaking-your-first</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/building-and-breaking-your-first</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Fri, 16 May 2025 14:33:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xvcB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44d6555-a755-40ae-b180-dfccbddcaad2_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2><blockquote><p><strong>Recap from the previous article:</strong> In the last couple of articles, we built a simple but complete mental model of how a basic computer executes instructions. We explored how an ALU performs arithmetic operations, how registers serve as fast-access storage, and how the control unit fetches and executes instructions stored in memory. We also introduced the structure of assembly programs, the use of labels, and how data and instructions are laid out in different sections like <code>.text</code> and <code>.data</code>. That article provided the conceptual foundation we need to now dive into real X86-64 assembly code.</p></blockquote><p>This article is part of my series on the basics of X86-64 assembly programming. Until now, we have been working mostly with ideas. We talked about what it means for a computer to execute a program, how computation is carried out by hardware, and how memory is laid out to store data and instructions. We have seen snippets of assembly here and there, but we haven&#8217;t written a full program yet. That changes now.</p><p>In this article, we will write our first complete (well, almost) assembly program. It won&#8217;t do anything exciting, but that&#8217;s the point. Like &#8220;Hello, world&#8221; in high-level languages, this program is just a vehicle to help us understand the mechanics of how an assembly program is written, assembled, linked, and executed. Along the way, we&#8217;ll revisit some of the concepts we&#8217;ve discussed before and see how they manifest in actual code.</p><blockquote><p>If you haven&#8217;t read the previous articles in this series, here&#8217;s what you have missed:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6fbd0d96-76da-4809-9486-049d65f12cca&quot;,&quot;caption&quot;:&quot;&#8220;Do not try to bend the spoon. That's impossible. Instead, only try to realize the truth... there is no spoon.&#8221; &#8212; The Matrix&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Seeing the Matrix: A First-Principles Approach to Computer Architecture&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-04-05T17:54:52.832Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/seeing-the-matrix&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:160249113,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:39,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6bab7f60-e99d-4fc0-a2f0-1fbab0d71b88&quot;,&quot;caption&quot;:&quot;In this article, we&#8217;ll trace how the structure of assembly programs is shaped by the expectations of the hardware&#8217;s execution model. We&#8217;ll follow this causal path through the operating system and the ELF file format, and explain why assembly programs are written in terms of sections and labels, like the one shown in the figure below (don&#8217;t worry if it looks alien, I promise that by the end of the article you will understand what each line of this code is doing).&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The System-Level Foundation of Assembly&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-05-05T08:36:27.146Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77cbf196-9752-4ec4-a24e-0ad8e8124cb3_627x551.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/the-system-level-foundation-of-assembly&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:162823255,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:23,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;0018b8cc-e03b-4a49-b316-93ad6a63d6de&quot;,&quot;caption&quot;:&quot;In our previous article, we explored how computers work from transistors up to program execution. We saw how digital circuits built from logic gates perform calculations using binary data, and how the ALU executes operations on this binary representation.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Binary Arithmetic and Bitwise Operations for Systems Programming&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I&#8217;m a systems programmer with a focus on performance engineering, compilers, and OS internals. I break down CPUs, interpreters, and runtime mechanics &#8212; with deep dives that go beyond the surface-level.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-04-12T05:16:14.645Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/binary-arithmetic-and-bitwise-operations&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:161089202,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:26,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><div><hr></div><p><em><strong>This article is part of a paid subscriber series.</strong><br>If you&#8217;re enjoying the content, please consider upgrading to a paid plan to unlock the rest of this series. Paid subscribers also get access to recordings of past live sessions, early and discounted access to courses and books, and more.</em></p><p><em>Alternatively, you can purchase an ebook version of this series. (If you're already a paid subscriber, email me for a discounted link.)</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk&quot;,&quot;text&quot;:&quot;I Want the PDF&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://codingconfessions.gumroad.com/l/ychdk"><span>I Want the PDF</span></a></p>
      <p>
          <a href="https://blog.codingconfessions.com/p/building-and-breaking-your-first">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The System-Level Foundation of Assembly]]></title><description><![CDATA[Tracing how the CPU, OS, and ELF format shape the structure of your assembly code]]></description><link>https://blog.codingconfessions.com/p/the-system-level-foundation-of-assembly</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/the-system-level-foundation-of-assembly</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Mon, 05 May 2025 08:36:27 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/77cbf196-9752-4ec4-a24e-0ad8e8124cb3_627x551.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div><hr></div><p><em>We wrapped up the <a href="https://blog.codingconfessions.com/p/course-launch-hands-on-introduction">X86-64 assembly course</a> last week, and I&#8217;ll be sharing notes from the sessions here as a series of articles. While the live sessions covered much more ground, I think you&#8217;ll find these write-ups valuable in their own right. I&#8217;ll be publishing them gradually over the next few weeks. </em></p><p><em>I&#8217;ll announce the next run of the course soon. Paid subscribers get early access and a discount, so upgrade now if you&#8217;d like to reserve your spot.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/subscribe?"><span>Subscribe now</span></a></p><p>I&#8217;m also compiling this article series into a cleanly typeset PDF edition, available for purchase below. If you're a paid subscriber, email me for a discount code.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk&quot;,&quot;text&quot;:&quot;Get the Ebook&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://codingconfessions.gumroad.com/l/ychdk"><span>Get the Ebook</span></a></p><div><hr></div><p>In the <a href="https://blog.codingconfessions.com/p/seeing-the-matrix">previous article</a>, we looked at how a CPU executes instructions by fetching them from memory, decoding them, executing them, and repeating the cycle. This gave us a foundational understanding of how the hardware runs a program.</p><p>Now we can start writing some assembly programs, but jumping in immediately would mean skipping a few important abstraction layers, and that might leave some gaps in our understanding.</p><p>Specifically, an assembly program is translated into an executable binary by the assembler and the linker. That binary is then loaded into memory by the operating system so that the CPU can begin executing it. So there&#8217;s quite a bit that happens between writing assembly and running it on real hardware.</p><p>The operating system must load the executable into memory in a layout that follows the hardware&#8217;s execution model: instructions must be contiguous in memory, and instructions and data must be kept separate. To enable this layout, the executable binary must reflect this structure. Since the binary is generated from the assembly source files, the assembly program must also follow a structure that matches. All of this is linked.</p><p>In this article, we&#8217;ll trace how the structure of assembly programs is shaped by the expectations of the hardware&#8217;s execution model. We&#8217;ll follow this causal path through the operating system and the <a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF file format</a>, and explain why assembly programs are written in terms of sections and labels, like the one shown in the figure below (don&#8217;t worry if it looks alien, I promise that by the end of the article you will understand what each line of this code is doing).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W6Xb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W6Xb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png 424w, https://substackcdn.com/image/fetch/$s_!W6Xb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png 848w, https://substackcdn.com/image/fetch/$s_!W6Xb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png 1272w, https://substackcdn.com/image/fetch/$s_!W6Xb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W6Xb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png" width="1017" height="316" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:316,&quot;width&quot;:1017,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74076,&quot;alt&quot;:&quot;The skeletal structure of an X86 assembly program&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/162823255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The skeletal structure of an X86 assembly program" title="The skeletal structure of an X86 assembly program" srcset="https://substackcdn.com/image/fetch/$s_!W6Xb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png 424w, https://substackcdn.com/image/fetch/$s_!W6Xb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png 848w, https://substackcdn.com/image/fetch/$s_!W6Xb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png 1272w, https://substackcdn.com/image/fetch/$s_!W6Xb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff633e54b-5d0a-4c1f-9517-8ac593517d8d_1017x316.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure-1: The skeletal structure of an X86 assembly program</figcaption></figure></div><p><strong>By the end of the article, we will learn the following:</strong></p><ul><li><p>How the operating system structures a process's memory layout and why that structure is required</p></li><li><p>How ELF executables are structured and why they follow that layout</p></li><li><p>What sections and labels are in assembly programming and why you need to use them</p></li></ul><div><hr></div><h2>Recap: Hardware's Instruction Execution Model</h2><p>Let's start by doing a quick recap of the hardware's instruction execution cycle which we covered in quite detail in the previous article. This background is the foundation for everything we will learn in this article.</p><ul><li><p>The processor has a special register called the instruction pointer register that contains the address of the instruction that the processor has to execute next.</p></li><li><p>The processor also has a component called the control unit that is responsible for orchestrating the execution of program instructions on the hardware. The control unit begins by fetching the instruction located at the memory address held in the instruction pointer register</p></li><li><p>The control unit then decodes the instruction to identify the opcode and the operands. For example, the opcode may be add and the operands could be the registers R8 and R9</p></li><li><p>Next,the control unit sends control signals to the register file and execution units to execute the instruction.</p></li><li><p>After this, the instruction pointer register gets incremented by the size of the current instruction so that it now has the address of the next instruction.</p></li><li><p>And this cycle repeats.</p></li></ul><p>This is the hardware execution model. To enable hardware to execute your code, you must load your program&#8217;s instructions and data into memory and then update the instruction pointer with the address of the first instruction of your program.</p><p>Fortunately, we don&#8217;t need to do this ourselves, the operating system does this for us. And how the operating system does this is directly shaped by the hardware&#8217;s expectations. Let&#8217;s see how the operating system sets up a new process for execution.</p><h2>Process Setup and Memory Layout</h2><p>The hardware instruction execution model puts certain requirements on the OS when creating a new process:</p><ul><li><p>The instruction pointer must contain the address of the first instruction of the program in memory</p></li><li><p>The subsequent instructions must be stored contiguously in memory. Because after the first instruction, the hardware simply increments the instruction pointer to find the address of the next one. This model requires that all instructions must be placed next to each other in memory. </p><ul><li><p>Figure-2 shows this more visually, where you can see how by simply incrementing the address in the instruction pointer the CPU can advance through your program to execute it.</p></li></ul></li><li><p>A third, often unstated, requirement comes from security concerns: code and data must be kept separate in memory. The hardware itself doesn&#8217;t distinguish between them, so a malicious program could modify its own data to insert harmful instructions. If the CPU then executes that data as code, it could lead to serious exploits. To prevent this, it's essential that:</p><ul><li><p>Instructions are stored in a memory region with read and execute<strong> </strong>permissions, but not write</p></li><li><p>Data is stored in a separate region with read and write permissions, but not execute &#8212; except in special cases like JIT compilers</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gblH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gblH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png 424w, https://substackcdn.com/image/fetch/$s_!gblH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png 848w, https://substackcdn.com/image/fetch/$s_!gblH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png 1272w, https://substackcdn.com/image/fetch/$s_!gblH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gblH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png" width="1126" height="442" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:442,&quot;width&quot;:1126,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46471,&quot;alt&quot;:&quot;Figure: Instructions are stored contiguously in memory. The instruction pointer register advances through these instructions by incrementing its value after each instruction is executed.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/162823255?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure: Instructions are stored contiguously in memory. The instruction pointer register advances through these instructions by incrementing its value after each instruction is executed." title="Figure: Instructions are stored contiguously in memory. The instruction pointer register advances through these instructions by incrementing its value after each instruction is executed." srcset="https://substackcdn.com/image/fetch/$s_!gblH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png 424w, https://substackcdn.com/image/fetch/$s_!gblH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png 848w, https://substackcdn.com/image/fetch/$s_!gblH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png 1272w, https://substackcdn.com/image/fetch/$s_!gblH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf6ed540-4984-4bd3-9a38-c30f64a8b9b6_1126x442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><strong>Figure-2</strong>: Instructions are stored contiguously in memory. The instruction pointer register advances through these instructions by incrementing its value after each instruction is executed.</figcaption></figure></div><p>So, when the OS creates a new process, it organizes the address space layout of the process to satisfy these hardware-level requirements. The following diagram shows this layout and you can clearly see it is split into distinct memory segments, such as <code>.text</code>, and <code>.data</code> which is where program instructions and static data are stored.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fTcN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fTcN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png 424w, https://substackcdn.com/image/fetch/$s_!fTcN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png 848w, https://substackcdn.com/image/fetch/$s_!fTcN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png 1272w, https://substackcdn.com/image/fetch/$s_!fTcN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fTcN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png" width="580" height="966" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:966,&quot;width&quot;:580,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure-3: The address space layout of a process. Each box represents a distinct region of memory called a segment. Common segments are .text, .data, .bss, stack, heap. &quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure-3: The address space layout of a process. Each box represents a distinct region of memory called a segment. Common segments are .text, .data, .bss, stack, heap. " title="Figure-3: The address space layout of a process. Each box represents a distinct region of memory called a segment. Common segments are .text, .data, .bss, stack, heap. " srcset="https://substackcdn.com/image/fetch/$s_!fTcN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png 424w, https://substackcdn.com/image/fetch/$s_!fTcN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png 848w, https://substackcdn.com/image/fetch/$s_!fTcN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png 1272w, https://substackcdn.com/image/fetch/$s_!fTcN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94eecbd2-b91d-4c5c-bbab-41ce1ffb865a_580x966.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure-3: The address space layout of a process. Each box represents a distinct region of memory called a segment. Common segments are .text, .data, .bss, stack, heap. </figcaption></figure></div><blockquote><p><em>You may have seen this memory layout of a process in many articles and books, but not many explain the underlying reason behind it. But now you know!</em></p></blockquote><p>Each of these segments appear as part of a single virtual address space, but at the physical level they may be mapped to different regions of the physical memory. Each segment also has the appropriate protection bits set to ensure that the hardware doesn't end up reading and executing instructions from one of the other segments.</p><p>But creating these segments is not enough; they also need to be populated with data. For this the operating system loads the program's executable binary into memory and loads the <code>.text</code> segment with all the code, and the <code>.data</code> segment with the statically initialized data. </p><p>However, to do this efficiently, the assembler and linker must generate the executable in a format that supports fast loading. On Unix like systems, this format is the executable and linkable format (ELF), and it is designed to support fast loading of program data into the process memory. Let&#8217;s see how this format looks like from the inside.</p><h2>Understanding the ELF Executable Format</h2>
      <p>
          <a href="https://blog.codingconfessions.com/p/the-system-level-foundation-of-assembly">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Binary Arithmetic and Bitwise Operations for Systems Programming]]></title><description><![CDATA[Understand how computers represent numbers and perform operations at the bit level before diving into assembly]]></description><link>https://blog.codingconfessions.com/p/binary-arithmetic-and-bitwise-operations</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/binary-arithmetic-and-bitwise-operations</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Sat, 12 Apr 2025 05:16:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EY6-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EY6-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EY6-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png 424w, https://substackcdn.com/image/fetch/$s_!EY6-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png 848w, https://substackcdn.com/image/fetch/$s_!EY6-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png 1272w, https://substackcdn.com/image/fetch/$s_!EY6-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EY6-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png" width="714" height="483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:714,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:479944,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161089202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EY6-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png 424w, https://substackcdn.com/image/fetch/$s_!EY6-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png 848w, https://substackcdn.com/image/fetch/$s_!EY6-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png 1272w, https://substackcdn.com/image/fetch/$s_!EY6-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2107a8d5-63b6-411e-ae05-f9d7498e52cb_714x483.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In our <a href="https://blog.codingconfessions.com/p/seeing-the-matrix">previous article</a>, we explored how computers work from transistors up to program execution. We saw how digital circuits built from logic gates perform calculations using binary data, and how the ALU executes operations on this binary representation.</p><p>Now we&#8217;ll dive deeper the binary number system. When writing assembly code, you&#8217;ll directly manipulate bits in registers, perform calculations at the bit level, and for that you need to understand exactly how the processor interprets the patterns of 1s and 0s.</p><p>This article covers four key areas:</p><ol><li><p><strong>Number systems</strong>: How binary and hexadecimal work, and why we use them in low-level programming</p></li><li><p><strong>Binary arithmetic</strong>: How computers add and subtract, and detect conditions like overflow</p></li><li><p><strong>Two&#8217;s complement</strong>: How the hardware represents negative numbers</p></li><li><p><strong>Bitwise operations</strong>: bit manipulation techniques used throughout systems programming</p></li></ol><p>These concepts appear repeatedly in assembly programming, from register manipulation to optimized algorithms. By developing an intuition for binary operations, you&#8217;ll gain deeper insight into how processors work and how to write efficient assembly code.</p><div><hr></div><h4><em>Read it in PDF Form</em></h4><p><em>This article is part of the material I&#8217;m developing for my <a href="https://blog.codingconfessions.com/p/course-launch-hands-on-introduction">X86 assembly course</a>, that I am also putting together in a PDF form. If you are a paid subscriber you can claim it for free using the discount code in the email header (or ask me for the code).</em> </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk&quot;,&quot;text&quot;:&quot;Get PDF&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://codingconfessions.gumroad.com/l/ychdk"><span>Get PDF</span></a></p><div><hr></div><h2>Number Systems: Decimal, Binary, and Hexadecimal</h2><h3>The Decimal System: Our Familiar Base-10</h3><p>The decimal system is so natural to us that we rarely think about why we use it. It has 10 distinct digits (0-9), and the position of each digit represents a power of 10:</p><p>For example, the following figure shows the representation of the decimal number 4,729</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rx5H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rx5H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png 424w, https://substackcdn.com/image/fetch/$s_!rx5H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png 848w, https://substackcdn.com/image/fetch/$s_!rx5H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png 1272w, https://substackcdn.com/image/fetch/$s_!rx5H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rx5H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png" width="856" height="65" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:65,&quot;width&quot;:856,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7787,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161089202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rx5H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png 424w, https://substackcdn.com/image/fetch/$s_!rx5H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png 848w, https://substackcdn.com/image/fetch/$s_!rx5H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png 1272w, https://substackcdn.com/image/fetch/$s_!rx5H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d6e229-cd95-4497-9fe7-41f897b6714f_856x65.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Decimal representation of the value 4,729</figcaption></figure></div><p>This system is called &#8220;base-10&#8221; or &#8220;radix-10&#8221; because it uses 10 as its base value for positional notation.</p><h3>The Binary System: Base-2</h3><p>Computers use the binary system (base-2) because digital circuits have two stable states. As we saw in our previous article, transistors function as switches that are either on or off, which naturally maps to 1 and 0. The logic gates built from these transistors process binary data, making binary the native language of all digital hardware.</p><p>In binary, each position represents a power of 2 rather than a power of 10:</p><p>For example, the following figure shows the broken down representation of the binary number <code>1011</code></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rzMv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rzMv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png 424w, https://substackcdn.com/image/fetch/$s_!rzMv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png 848w, https://substackcdn.com/image/fetch/$s_!rzMv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png 1272w, https://substackcdn.com/image/fetch/$s_!rzMv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rzMv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png" width="884" height="63" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:63,&quot;width&quot;:884,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Binary representation of the value 1011&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Binary representation of the value 1011" title="Binary representation of the value 1011" srcset="https://substackcdn.com/image/fetch/$s_!rzMv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png 424w, https://substackcdn.com/image/fetch/$s_!rzMv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png 848w, https://substackcdn.com/image/fetch/$s_!rzMv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png 1272w, https://substackcdn.com/image/fetch/$s_!rzMv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e134bfb-1afb-4d4b-bc69-b754ef80bc68_884x63.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Binary representation of the value 1011</figcaption></figure></div><p>Which equals 8 + 0 + 2 + 1 = 11 in decimal.</p><p>Converting from decimal to binary involves repeatedly dividing by 2 and tracking the remainders. Let&#8217;s convert 27 to binary:</p><pre><code><code>
27 &#247; 2 = 13 remainder 1 (least significant bit)

13 &#247; 2 = 6 remainder 1

6 &#247; 2 = 3 remainder 0

3 &#247; 2 = 1 remainder 1

1 &#247; 2 = 0 remainder 1 (most significant bit)
</code></code></pre><p>Reading the remainders from bottom to top: 11011, which is indeed 27 in binary.</p><h3>Bit and Byte Terminology</h3><p>When working with binary data, we need precise terminology to refer to specific portions of the data. The following terms are essential in assembly programming:</p><h4>Most Significant Bit (MSB) and Least Significant Bit (LSB)</h4><p>Binary numbers have two "ends" that are particularly important:</p><p><strong>Least Significant Bit (LSB)</strong>: This is the rightmost bit in a binary number. It represents the 2^0 (1) position and contributes the smallest value to the total. The LSB tells us whether the number is odd or even (1 = odd, 0 = even).</p><p><strong>Most Significant Bit (MSB)</strong>: This is the leftmost bit in a binary number. It represents the highest power of 2 in the value and contributes the largest amount to the total.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qhMJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qhMJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png 424w, https://substackcdn.com/image/fetch/$s_!qhMJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png 848w, https://substackcdn.com/image/fetch/$s_!qhMJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png 1272w, https://substackcdn.com/image/fetch/$s_!qhMJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qhMJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png" width="611" height="201" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:201,&quot;width&quot;:611,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15175,&quot;alt&quot;:&quot;The MSB is the bit at the highest bit position in the value, while the LSB is the bit at the 0th position. The figure highlights them in an 8-bit value.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161089202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The MSB is the bit at the highest bit position in the value, while the LSB is the bit at the 0th position. The figure highlights them in an 8-bit value." title="The MSB is the bit at the highest bit position in the value, while the LSB is the bit at the 0th position. The figure highlights them in an 8-bit value." srcset="https://substackcdn.com/image/fetch/$s_!qhMJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png 424w, https://substackcdn.com/image/fetch/$s_!qhMJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png 848w, https://substackcdn.com/image/fetch/$s_!qhMJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png 1272w, https://substackcdn.com/image/fetch/$s_!qhMJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd802eb71-eede-453a-9029-a7bb8f0c1d27_611x201.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The MSB is the bit at the highest bit position in the value, while the LSB is the bit at the 0th position. The figure highlights them in an 8-bit value.</figcaption></figure></div><h4>Most Significant Byte (MSB) and Least Significant Byte (LSB)</h4><p>When working with multi-byte values (like 16-bit, 32-bit, or 64-bit numbers), we also need terminology for the bytes themselves:</p><p><strong>Least Significant Byte (LSB)</strong>: This is the byte containing the least significant bits of a multi-byte value.</p><p><strong>Most Significant Byte (MSB)</strong>: This is the byte containing the most significant bits.</p><p>For example, the 16-bit hexadecimal value 0x4A3F consists of two bytes:</p><p>- The MSB is 0x4A</p><p>- The LSB is 0x3F</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aWYU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aWYU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png 424w, https://substackcdn.com/image/fetch/$s_!aWYU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png 848w, https://substackcdn.com/image/fetch/$s_!aWYU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png 1272w, https://substackcdn.com/image/fetch/$s_!aWYU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aWYU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png" width="465" height="140" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:140,&quot;width&quot;:465,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8440,&quot;alt&quot;:&quot;In multi-byte numbers the LSB is the topmost byte, while the LSB is the bottommost byte. The figure highlights MSB and LSB in a two-byte value&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161089202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="In multi-byte numbers the LSB is the topmost byte, while the LSB is the bottommost byte. The figure highlights MSB and LSB in a two-byte value" title="In multi-byte numbers the LSB is the topmost byte, while the LSB is the bottommost byte. The figure highlights MSB and LSB in a two-byte value" srcset="https://substackcdn.com/image/fetch/$s_!aWYU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png 424w, https://substackcdn.com/image/fetch/$s_!aWYU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png 848w, https://substackcdn.com/image/fetch/$s_!aWYU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png 1272w, https://substackcdn.com/image/fetch/$s_!aWYU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5986835-6bd3-40f6-9497-fa2ae79720e1_465x140.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">In multi-byte numbers the LSB is the topmost byte, while the LSB is the bottommost byte. The figure highlights MSB and LSB in a two-byte value</figcaption></figure></div><blockquote><p><strong>Note about byte order</strong>: When multi-byte values are stored in memory, the order of the bytes becomes important. Different computer architectures may store the bytes in different orders (most significant byte first or least significant byte first). This concept, called "endianness," will become relevant when we discuss memory operations in later parts.</p></blockquote><h3>The Hexadecimal System: Base-16</h3><p>Binary representation quickly becomes unwieldy when dealing with larger numbers. A 32-bit value would require 32 binary digits, which is difficult to read and prone to error. This is where the hexadecimal system (base-16) becomes valuable.</p><p>Hexadecimal uses 16 symbols: 0-9 and A-F (where A=10, B=11, &#8230;, F=15). Each hexadecimal digit represents exactly 4 binary digits (a &#8220;nibble&#8221;), making conversion between binary and hexadecimal straightforward.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7oar!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7oar!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png 424w, https://substackcdn.com/image/fetch/$s_!7oar!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png 848w, https://substackcdn.com/image/fetch/$s_!7oar!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png 1272w, https://substackcdn.com/image/fetch/$s_!7oar!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7oar!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png" width="383" height="524" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1de07b41-34a6-474f-908d-73fa625ae409_383x524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:383,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34042,&quot;alt&quot;:&quot;Table showing the binary and hexadecimal representation for numbers from 0 to 15&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161089202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Table showing the binary and hexadecimal representation for numbers from 0 to 15" title="Table showing the binary and hexadecimal representation for numbers from 0 to 15" srcset="https://substackcdn.com/image/fetch/$s_!7oar!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png 424w, https://substackcdn.com/image/fetch/$s_!7oar!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png 848w, https://substackcdn.com/image/fetch/$s_!7oar!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png 1272w, https://substackcdn.com/image/fetch/$s_!7oar!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1de07b41-34a6-474f-908d-73fa625ae409_383x524.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Table showing the binary and hexadecimal representation for numbers from 0 to 15</figcaption></figure></div><p>This compact representation makes hexadecimal particularly useful for expressing binary values. For example, the binary number <code>1011010010011110</code> can be more compactly written as <code>0xB49E</code> in hexadecimal. The &#8220;<code>0x</code>&#8221; prefix is a common notation indicating a hexadecimal number.</p><p>To convert this back to decimal:</p><ul><li><p>B = 11 &#215; 16&#179; = 11 &#215; 4096 = 45056</p></li><li><p>4 = 4 &#215; 16&#178; = 4 &#215; 256 = 1024</p></li><li><p>9 = 9 &#215; 16&#185; = 9 &#215; 16 = 144</p></li><li><p>E = 14 &#215; 16&#8304; = 14 &#215; 1 = 14</p></li></ul><p>Adding these values: 45056 + 1024 + 144 + 14 = 46238</p><h3>Why These Number Systems Matter in Assembly</h3><p>When working with assembly language, you&#8217;ll constantly use all three number systems:</p><ol><li><p><strong>Binary</strong> is the processor&#8217;s native language. All the data in memory and registers is represented in binary and understanding this representation makes it easier to manipulate it.</p></li><li><p><strong>Decimal</strong> is useful for human-friendly values and calculations.</p></li><li><p><strong>Hexadecimal</strong> serves as the standard representation for memory addresses because it is easier to read than binary.</p></li></ol><p>In assembly, you&#8217;ll typically express values in decimal or hexadecimal:</p><pre><code><code>
10 # Decimal 10

0Ah # Hexadecimal A (decimal 10)

0x0A # Alternative hexadecimal notation
</code></code></pre><div><hr></div><h2>Binary Arithmetic</h2><p>Having explored how numbers are represented in binary, let&#8217;s now look at how computers perform calculations on these binary values. </p><h3>Binary Addition</h3><p>Binary addition follows similar rules to decimal addition, but with only two digits:</p><pre><code>0 + 0 = 0
0 + 1 = 1
1 + 0 = 1
1 + 1 = 0 with a carry of 1</code></pre><p>Let&#8217;s add the binary numbers <code>1011</code> (11 in decimal) and <code>101</code> (5 in decimal):<code> </code></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kRnZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kRnZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png 424w, https://substackcdn.com/image/fetch/$s_!kRnZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png 848w, https://substackcdn.com/image/fetch/$s_!kRnZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png 1272w, https://substackcdn.com/image/fetch/$s_!kRnZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kRnZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png" width="289" height="164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:164,&quot;width&quot;:289,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6001,&quot;alt&quot;:&quot;Binary addition of 1110 and 1011. The result is 10000&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161089202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Binary addition of 1110 and 1011. The result is 10000" title="Binary addition of 1110 and 1011. The result is 10000" srcset="https://substackcdn.com/image/fetch/$s_!kRnZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png 424w, https://substackcdn.com/image/fetch/$s_!kRnZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png 848w, https://substackcdn.com/image/fetch/$s_!kRnZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png 1272w, https://substackcdn.com/image/fetch/$s_!kRnZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc28f16-e2e0-4d48-b449-7ae8294c696e_289x164.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Binary addition of 1110 and 1011. The result is 10000</figcaption></figure></div><p>The result, <code>10000</code>, is 16 in decimal, which is 11 + 5.</p><p>This calculation mirrors exactly what happens in the ALU&#8217;s adder circuit we examined previously. The carry bits generated during this process are physically propagated through the full adder circuits chained together to handle multi-bit addition.</p><h3>The Processor&#8217;s Status Flags</h3><p>To manage the results of operations, processors maintain a set of status flags that indicate various conditions. These flags are stored in a special register called the status register or flags register.</p><p>Four particularly important flags are:</p><ol><li><p><strong>Carry Flag (CF)</strong>: Set when an unsigned arithmetic operation produces a carry or borrow</p></li><li><p><strong>Zero Flag (ZF)</strong>: Set when an operation produces a result of zero</p></li><li><p><strong>Sign Flag (SF)</strong>: Set when an operation produces a negative result (the most significant bit is 1)</p></li><li><p><strong>Overflow Flag (OF)</strong>: Set when a signed arithmetic operation produces a result outside the representable range</p></li></ol><p>These flags are automatically updated after most arithmetic and logical operations. They&#8217;re crucial for implementing control flow in assembly code, as they allow the program to make decisions based on the results of calculations.</p><h3>Understanding the Carry Flag</h3><p>Because of the fixed-width registers in the processor, there is always a likely chance for the arithmetic operations to result in values that are too big to fit in the registers, i.e., the operations result in the generation of a carry.</p><p>To track these carries, the processor sets the carry bit in the flags register (in X86-64, the register name is <code>rflags</code>). The carry flag comes handy in several situations. Let&#8217;s discuss these.</p><h4>Detecting Unsigned Overflow</h4><p>The most straightforward use of the carry flag is to detect when an arithmetic result is too large to fit in the available bits - a condition called overflow.</p><p>For example, imagine adding two 8-bit unsigned numbers 242 and 18.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eq1l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eq1l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png 424w, https://substackcdn.com/image/fetch/$s_!eq1l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png 848w, https://substackcdn.com/image/fetch/$s_!eq1l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png 1272w, https://substackcdn.com/image/fetch/$s_!eq1l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eq1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png" width="712" height="403" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:403,&quot;width&quot;:712,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40357,&quot;alt&quot;:&quot;An example showing addition of two 8-bit values with a carry bit. The carry bit will result in the hardware setting the carry flag in the flags register to indicate an overflow&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161089202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An example showing addition of two 8-bit values with a carry bit. The carry bit will result in the hardware setting the carry flag in the flags register to indicate an overflow" title="An example showing addition of two 8-bit values with a carry bit. The carry bit will result in the hardware setting the carry flag in the flags register to indicate an overflow" srcset="https://substackcdn.com/image/fetch/$s_!eq1l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png 424w, https://substackcdn.com/image/fetch/$s_!eq1l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png 848w, https://substackcdn.com/image/fetch/$s_!eq1l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png 1272w, https://substackcdn.com/image/fetch/$s_!eq1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4d3f834-68a6-4b0c-a47f-0b8282c9eb01_712x403.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An example showing addition of two 8-bit values with a carry bit. The carry bit will result in the hardware setting the carry flag in the flags register to indicate an overflow</figcaption></figure></div><p>The result, 260, doesn&#8217;t fit in 8 bits (which can only represent values from 0 to 255). The &#8220;1&#8221; at the left falls outside our 8-bit range. The processor sets the carry flag to indicate this overflow condition.</p><p>Why is this important? In real programs, if you don&#8217;t detect overflow, your calculations will silently produce incorrect results:</p><ul><li><p>The actual stored result would be just <code>00000100</code> (4 in decimal)</p></li><li><p>Your program would continue using this wrong value (4) instead of the correct result (260)</p></li></ul><p>Consider an accounting program that adds large financial values, undetected overflow could cause funds to &#8220;disappear&#8221;!</p><h4>Multi-Precision Arithmetic</h4><p>&#8220;Multi-precision arithmetic&#8221; simply means working with numbers that are larger than what fits in a single register.</p><p>For example, let&#8217;s say we&#8217;re using an 8-bit processor but need to add two 16-bit numbers. We&#8217;d need to:</p><ol><li><p>Add the lower 8 bits of both numbers</p></li><li><p>Add the upper 8 bits of both numbers</p></li><li><p>Account for any carry from the first addition</p></li></ol><p>Here&#8217;s how it works, adding <code>1000</code> (<code>0x03E8</code>) <code>+</code> <code>2000</code> (<code>0x07D0</code>):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8R20!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8R20!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png 424w, https://substackcdn.com/image/fetch/$s_!8R20!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png 848w, https://substackcdn.com/image/fetch/$s_!8R20!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png 1272w, https://substackcdn.com/image/fetch/$s_!8R20!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8R20!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png" width="558" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:558,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60767,&quot;alt&quot;:&quot;An example of addition of multi-precision arithmetic using the carry flag. The addition of two 16-bit values is broken down into two parts. First the lower 8 bits are added, then the upper 8-bits are added along with the carry from the previous step.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/161089202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An example of addition of multi-precision arithmetic using the carry flag. The addition of two 16-bit values is broken down into two parts. First the lower 8 bits are added, then the upper 8-bits are added along with the carry from the previous step." title="An example of addition of multi-precision arithmetic using the carry flag. The addition of two 16-bit values is broken down into two parts. First the lower 8 bits are added, then the upper 8-bits are added along with the carry from the previous step." srcset="https://substackcdn.com/image/fetch/$s_!8R20!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png 424w, https://substackcdn.com/image/fetch/$s_!8R20!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png 848w, https://substackcdn.com/image/fetch/$s_!8R20!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png 1272w, https://substackcdn.com/image/fetch/$s_!8R20!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1555b2c8-bdd5-4545-ba6b-e3dd3d259bf4_558x620.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An example of addition of multi-precision arithmetic using the carry flag. The addition of two 16-bit values is broken down into two parts. First the lower 8 bits are added, then the upper 8-bits are added along with the carry from the previous step.</figcaption></figure></div><p>The result is <code>0x0BB8</code>, which is 3000 in decimal. Without tracking the carry from the first addition, we would get <code>0x0AB8</code> (2744), which is wrong.</p><p>Assembly languages provide special instructions for these operations. For example, in x86, the <code>adc</code> (add with carry) instruction adds two values plus the carry flag, making multi-precision arithmetic possible.</p><h4>The Foundation of Comparison Operations</h4><p>The carry flag is also used for comparing unsigned values. When the processor compares two values, it actually subtracts them and sets flags based on the result, without storing the subtraction result.</p><p>For example, when comparing unsigned values A and B:</p><ul><li><p>If A &lt; B, the subtraction A - B requires a borrow, setting the carry flag</p></li><li><p>If A &gt; B, no borrow is needed, clearing the carry flag</p></li><li><p>if A == B, the zero flag is set, indicating the values are equal</p></li></ul><p>By checking the value of the carry flag we can figure out the result of the comparison, whether it was true or false, and execute appropriate code. When we learn about implementing conditional flow (if conditions) in assembly, we will see how this comes into action.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk&quot;,&quot;text&quot;:&quot;Get PDF&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://codingconfessions.gumroad.com/l/ychdk"><span>Get PDF</span></a></p>
      <p>
          <a href="https://blog.codingconfessions.com/p/binary-arithmetic-and-bitwise-operations">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Understanding Computer Organization from First Principles]]></title><description><![CDATA[A ground-up model of how computers execute code, starting from logic gates and ending at the instruction cycle.]]></description><link>https://blog.codingconfessions.com/p/seeing-the-matrix</link><guid isPermaLink="false">https://blog.codingconfessions.com/p/seeing-the-matrix</guid><dc:creator><![CDATA[Abhinav Upadhyay]]></dc:creator><pubDate>Sat, 05 Apr 2025 17:54:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aKam!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p>&#8220;Do not try to bend the spoon. That's impossible. Instead, only try to realize the truth... there is no spoon.&#8221; &#8212; The Matrix</p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aKam!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aKam!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!aKam!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!aKam!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!aKam!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aKam!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png" width="388" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:388,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;There is no spoon&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="There is no spoon" title="There is no spoon" srcset="https://substackcdn.com/image/fetch/$s_!aKam!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!aKam!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!aKam!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!aKam!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c6b6f3-e65a-46be-ada9-68a166fbfcf8_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">There is no spoon</figcaption></figure></div><p>Most programmers work comfortably inside layers of abstraction: writing code, calling APIs, using tools, without needing to know what happens underneath. But systems-level thinking is about lifting the hood. It means understanding how things actually work, from source code all the way down to silicon.</p><p>This article is where that starts. We&#8217;ll build a concrete mental model of how a computer executes instructions, beginning at the hardware level with logic gates and circuits. From there, we&#8217;ll step through how those circuits form an ALU, how data moves through registers, and how a CPU follows instructions.</p><p>But that&#8217;s only part of the story. We&#8217;ll also look at how this hardware model shapes everything above it. How compilers turn code into machine instructions. How executables are structured. How the OS lays out a process in memory. None of these make full sense without the layer below.</p><p>If you want to understand systems, this is the foundation. You don&#8217;t need to memorize how every part works. What matters is building a model that helps you reason through the system when things break or behave in unexpected ways. That&#8217;s what systems-level thinking is about.</p><p>Let&#8217;s get started.</p><div><hr></div><p><em><strong>Quick heads-up before you read</strong></em></p><h5><em>This article was originally paywalled as part of my <a href="https://blog.codingconfessions.com/p/building-and-breaking-your-first">x86 assembly series</a>. If you find it useful, consider upgrading to get access to the full series, and discounts on upcoming courses and books.</em></h5><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.codingconfessions.com/subscribe?"><span>Subscribe now</span></a></p><h5><em>There&#8217;s also a PDF ebook version of the series (currently 60 pages across four chapters), available separately. Paid subscribers get 40% off with an annual plan and 20% off with a monthly one &#8212; just email me to get your discounted link.</em></h5><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://codingconfessions.gumroad.com/l/ychdk/&quot;,&quot;text&quot;:&quot;Get EBook&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://codingconfessions.gumroad.com/l/ychdk/"><span>Get EBook</span></a></p><div><hr></div><h2>Building a Very Simple Processor</h2><p>We will learn how the hardware executes code by doing a thought exercise where we construct a very simple processor based on the computational requirements of software, such as a simple calculator.</p><p>Even the calculator is capable of doing myriads of computations, so to begin with, we will focus on just one computation: adding two integers. Essentially, we want a simple computer capable of expressing and computing the following computation:</p><pre><code><code>int a = 10
int b = 20
int sum = a + b</code></code></pre><p>To be able to do this, what capabilities do we need in the hardware?</p><ul><li><p><strong>We need a way to represent information</strong>: How do we represent data such as the integers 10 and 20 here, and also how do we tell the hardware that it has to add the two values?</p></li><li><p><strong>Storing the input and output data</strong>: Where do these values of a and b live?</p></li><li><p><strong>Doing the actual computation</strong>: How do we perform the addition in the hardware?</p></li></ul><p>This line of inquiry leads us to the work done by Claude Shannon where we will find all our answers.</p><h3>Encoding Information using Binary</h3><p>You might be familiar with Claude Shannon&#8217;s work on information theory, which is the foundation underlying all modern communication systems and data compression techniques.</p><p>As part of this work, he came up with the idea of using electrical switches to encode information as binary data. Essentially, if we represent the state of the circuit when it is closed as 1 and when it is open as 0, then we can encode 1 bit of information in that circuit, and by combining multiple such circuits, we can encode more information.</p><p>By encoding information in binary, we can leverage the power of binary arithmetic and Boolean algebra to implement complex computational and logical calculations, giving way for general-purpose computation, which are modern computers.</p><h3>Transistors as Digital Switches</h3><p>But modern processors aren&#8217;t built using electrical switches, they are built using digital switches that can be turned on and off automatically. This is made possible through transistors.</p><p>Transistors are semiconductor devices that act as electronic switches. They conduct current only when the voltage applied to them is above or below a certain threshold, depending on their configuration. This ability to switch on or off based on voltage levels makes them ideal for implementing digital circuits. By precisely controlling the flow of current through circuits built from transistors, digital switches are created. These switches form the fundamental building blocks of all modern chips.</p><h3>Transistors to Logic Gates</h3><p>Transistors are the bottommost layer of the computing stack based on which everything else is built. They are combined in specific configurations to build reusable components called logic gates.</p><p>You can think of gates as mathematical functions (or, if you prefer code, then a function in code) that takes one or more inputs and produces an output. Because we are working with digital circuits, all the inputs and outputs here are 1s and 0s.</p><p>For instance, the <code>NOT</code> gate takes one parameter as input and produces one output. As its name suggests, it inverts its input. So <code>NOT(1)</code> = 0 and <code>NOT(0) = 1</code>.</p><p>Similarly, there is an <code>AND</code> gate that takes two inputs and produces one output (you can also make an <code>AND</code> gate that takes a larger number of inputs). Mathematically, it works like this:</p><pre><code><code>AND(0, 0) = 0
AND(0, 1) = 0
AND(1, 0) = 0
AND(1, 1) = 1</code></code></pre><p>We also have an <code>OR</code> gate which works like this:</p><pre><code><code>OR(0, 0) = 0
OR(0, 1) = 1
OR(1, 0) = 1
OR(1, 1) = 1</code></code></pre><p>Finally, there is a very useful gate called <code>XOR</code>:</p><pre><code><code>XOR(0, 0) = 0
XOR(0, 1) = 1
XOR(1, 0) = 1
XOR(1, 1) = 0</code></code></pre><p><a href="https://en.wikipedia.org/wiki/Boolean_algebra">Boolean algebra</a> establishes that by combining these basic operations, it is possible to compute any mathematical function, and this is how the computational circuits within the processor are designed. Let&#8217;s see how.</p><h3>Building Computational Circuits from Logic Gates</h3><p>So we started with the goal of implementing the functionality to add two integers in our simple hardware. And now we know that it can be accomplished using logic gates. The logic circuit which implements binary addition is called an adder. But, before looking at the circuit itself, let&#8217;s talk about binary addition.</p><p>Again, we can think of it like a mathematical or programming function. It receives two bits as input and produces two bits as output. One of the output bits represents the sum of the two input bits, and the 2nd output bit represents the overflow or carry of the result.</p><pre><code><code>add(0, 0) = sum: 0, carry: 0
add(0, 1) = sum: 1, carry: 0
add(1, 0) = sum: 1, carry: 0
add(1, 1) = sum: 0, carry: 1
</code></code></pre><p>What you will see is that the mapping of the input bits to the sum value is identical to that of the XOR gate, while the mapping of the carry bit is identical to that of the AND gate. It means that an adder can be implemented by sending the input to an AND gate and a XOR gate, like the following figure:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RGpj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RGpj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RGpj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RGpj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RGpj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RGpj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg" width="480" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:480,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The circuit diagram of a half adder. Inputs A and B flowing into a XOR gate and an AND gate. The XOR gate produces the sum of the two bits: S, and the AND gate produces the carry bit: C&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The circuit diagram of a half adder. Inputs A and B flowing into a XOR gate and an AND gate. The XOR gate produces the sum of the two bits: S, and the AND gate produces the carry bit: C" title="The circuit diagram of a half adder. Inputs A and B flowing into a XOR gate and an AND gate. The XOR gate produces the sum of the two bits: S, and the AND gate produces the carry bit: C" srcset="https://substackcdn.com/image/fetch/$s_!RGpj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RGpj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RGpj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RGpj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f9eb8f7-f0d0-473e-aa61-bcf61de5d9bf_480x300.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The circuit diagram of a half adder. Inputs A and B flowing into a XOR gate and an AND gate. The XOR gate produces the sum of the two bits: S, and the AND gate produces the carry bit: C</figcaption></figure></div><p>This design is called a half adder because it is not useful when adding multibit numbers. For addition of two multibit numbers, we need three inputs: two input bits from the numbers at a given position, and one carry bit from the addition of bits at the previous position. For this, a slightly modified circuit is used, called the full adder.</p><p>I will not show its construction because this article is not about digital design, but the following circuit shows what it looks like. By chaining these adders together, we can create circuits capable of adding multibit numbers.<br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h4jz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h4jz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png 424w, https://substackcdn.com/image/fetch/$s_!h4jz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png 848w, https://substackcdn.com/image/fetch/$s_!h4jz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png 1272w, https://substackcdn.com/image/fetch/$s_!h4jz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h4jz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png" width="979" height="427" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:427,&quot;width&quot;:979,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The circuit diagram of a full adder which takes three inputs, the two bits of the numbers being added and a carry bit from the addition of the bits at the previous position.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The circuit diagram of a full adder which takes three inputs, the two bits of the numbers being added and a carry bit from the addition of the bits at the previous position." title="The circuit diagram of a full adder which takes three inputs, the two bits of the numbers being added and a carry bit from the addition of the bits at the previous position." srcset="https://substackcdn.com/image/fetch/$s_!h4jz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png 424w, https://substackcdn.com/image/fetch/$s_!h4jz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png 848w, https://substackcdn.com/image/fetch/$s_!h4jz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png 1272w, https://substackcdn.com/image/fetch/$s_!h4jz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1a7651-155d-4049-ba1a-0cde60933630_979x427.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The circuit diagram of a full adder which takes three inputs, the two bits of the numbers being added and a carry bit from the addition of the bits at the previous position.</figcaption></figure></div><h3>The ALU</h3><p>A real-world processor consists of multiple different kinds of computational circuits for operations such as addition, subtraction, multiplication, division, and also logical operations (<code>AND</code>, <code>OR</code>, <code>NOT</code>). These circuits are combined in the form of an <strong>arithmetic logical unit (ALU)</strong> and the various computational circuits within it are called <strong>functional units</strong>.</p><p>The inputs flow into the ALU which activates the right functional unit and produces the output. Schematically for our simple processor it looks the following diagram.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!95YZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!95YZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png 424w, https://substackcdn.com/image/fetch/$s_!95YZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png 848w, https://substackcdn.com/image/fetch/$s_!95YZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png 1272w, https://substackcdn.com/image/fetch/$s_!95YZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!95YZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png" width="1089" height="338" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:338,&quot;width&quot;:1089,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39094,&quot;alt&quot;:&quot;The ALU design so far, containing only an adder.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/160249113?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The ALU design so far, containing only an adder." title="The ALU design so far, containing only an adder." srcset="https://substackcdn.com/image/fetch/$s_!95YZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png 424w, https://substackcdn.com/image/fetch/$s_!95YZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png 848w, https://substackcdn.com/image/fetch/$s_!95YZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png 1272w, https://substackcdn.com/image/fetch/$s_!95YZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46102d22-3960-460f-aa98-98ed17d3b1fc_1089x338.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The ALU design so far, containing only an adder.</figcaption></figure></div><h3>The Need for Storage: Introducing Registers</h3><p>As you can see in the ALU diagram, it receives some input, performs the computation, and produces an output. So, the question arises: <strong>from where do these inputs come and where does the result go afterward? </strong>The answer is registers.</p><p>Apart from building computational circuits, transistors can be used to construct circuits that can hold state as well, i.e. memory. Using such circuits, we can construct memory units, and one such unit is the register.</p><p>Registers are fixed-sized memory units capable of storing a small number of bits, for example, modern processors have 32 or 64 bit wide registers. These are used to temporarily hold the data during computation. For instance, when performing an add operation, the input parameters are first stored in the registers, and then their values are fed into the ALU to perform the computation.</p><p>Typically, during program execution, data is moved into the registers from main memory (the RAM) and after the computation is done, the result is written back to the main memory. This frees up the register for other computations.</p><p>In our example of building a simple calculator, we need:</p><ul><li><p>Two registers to hold the numbers we want to add (let&#8217;s say R1 and R2)</p></li><li><p>One register to hold the result of the addition. However, we could also use one of the two registers to store the output.</p></li></ul><p>But real-world processors have many registers, e.g., the X86 architecture has 16 general-purpose registers. These registers are combined into a register file from which the data flows into the ALU. Let&#8217;s update the architecture diagram of our processor to see how it looks after the introduction of a register file consisting of 6 registers :</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!33MT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!33MT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png 424w, https://substackcdn.com/image/fetch/$s_!33MT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png 848w, https://substackcdn.com/image/fetch/$s_!33MT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png 1272w, https://substackcdn.com/image/fetch/$s_!33MT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!33MT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png" width="1191" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/319b3449-08c6-4307-8e44-464dba90299b_1191x814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1191,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:119088,&quot;alt&quot;:&quot;The architecture of the processor so far, consisting of a register file and an ALU&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/160249113?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The architecture of the processor so far, consisting of a register file and an ALU" title="The architecture of the processor so far, consisting of a register file and an ALU" srcset="https://substackcdn.com/image/fetch/$s_!33MT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png 424w, https://substackcdn.com/image/fetch/$s_!33MT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png 848w, https://substackcdn.com/image/fetch/$s_!33MT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png 1272w, https://substackcdn.com/image/fetch/$s_!33MT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F319b3449-08c6-4307-8e44-464dba90299b_1191x814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The architecture of the processor so far, consisting of a register file and an ALU</figcaption></figure></div><p>The diagram has started to show the typical organization of computer hardware at an abstract level. Typically, the ALU isn&#8217;t the only execution unit within the processor, so we have abstracted it inside an execution unit. The data flows from the register file into the execution unit to execute the instructions, and an output comes out of the execution units. As we go along, we will furnish more details in the diagram.</p><p>At this stage, our ALU is extremely limited; it can only perform an addition. Let&#8217;s extend it.</p><h3>Adding Features and Control to Our Calculator</h3><p>A processor capable of just addition is not very useful. We need more features, such as subtraction, multiplication, and division. All of these require computational circuits similar to the adder. </p><p>For example, subtraction can be implemented using the adder itself by simply negating the second operand value. But operations like multiplication need their own specialized circuits. We can implement these additional operations by adding separate circuits for each one, resulting in a more functional ALU with multiple functional units.</p><blockquote><h5><em>While we discussed the construction of the adder at the level of logic gates, we will not cover the remaining circuits at that level of detail. While that knowledge is valuable, as software engineers, it is not necessary to understand how everything works at the circuit level.</em></h5></blockquote><p>With multiple functional units in the ALU and multiple registers, we need a way to control which operation is performed and which registers are involved. This is done by a component of the processor, called the control unit.</p><h3>The Control Unit and Decoders</h3><p>The Control Unit is responsible for managing the flow of data between the registers and the ALU, and for selecting the appropriate functional unit within the ALU. It orchestrates the entire computation process by sending specific control signals to activate exactly the right components at the right time.</p><p>To do this, the Control Unit must know three things:</p><ol><li><p><strong>What operation needs to be performed.</strong> (e.g., addition, multiplication)</p></li><li><p><strong>Where the operand lives.</strong> (i.e., which register holds the operand)</p></li><li><p><strong>Where the result needs to be stored.</strong> (i.e., the destination register, or the memory address)</p></li></ol><p>This information is provided to the Control Unit in the form of a <strong>binary encoded instruction</strong>.</p><p>Yes, here we are talking about the program that you and I write. Our programs get compiled down to a sequence of binary encoded instructions that the processor can decode and execute. We will talk about program execution later, but right now let&#8217;s focus on how the instructions are encoded and decoded.</p><h3>Instruction Encoding and Decoding</h3><p>Every hardware architecture usually has its own encoding format, and the details vary. So, instead of focusing on a specific implementation, we can try to get a general understanding of what this whole process involves. Let&#8217;s first understand how an encoded instruction may look like.</p><p>An instruction in our simple architecture needs to provide three pieces of information:</p><ul><li><p><strong>The opcode</strong>: which operation to perform</p></li><li><p><strong>The two operand registers</strong>: from where the data for executing the operation comes</p></li><li><p><strong>The destination operand register</strong>: where does the result go</p></li></ul><p>Now, let&#8217;s say our hardware currently supports four operations (+, -, /, *), then we can encode all of it using 2 bits. But to be future proof where we expect more operations to be added, we can make the opcode size as 3 bits and encode these as follows:</p><pre><code><code>000 = Addition  
001 = Multiplication
010 = Subtraction    
011 = Division</code></code></pre><p>Similarly, we have six registers which we can encode using 3 bits:</p><pre><code><code>000 = R1    
001 = R2    
010 = R3    
011 = R4    
100 = R5    
101 = R6</code></code></pre><p>For instance, an instruction to multiply the values in <code>R1</code> and <code>R2</code> and store the result in <code>R3</code> will be encoded as: <code>001000001010.</code></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qIUI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qIUI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png 424w, https://substackcdn.com/image/fetch/$s_!qIUI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png 848w, https://substackcdn.com/image/fetch/$s_!qIUI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png 1272w, https://substackcdn.com/image/fetch/$s_!qIUI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qIUI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png" width="958" height="273" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:273,&quot;width&quot;:958,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:256178,&quot;alt&quot;:&quot;The binary encoding of the instruction to multiply the values in R1 and R2, and to store the output in R3.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/160249113?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The binary encoding of the instruction to multiply the values in R1 and R2, and to store the output in R3." title="The binary encoding of the instruction to multiply the values in R1 and R2, and to store the output in R3." srcset="https://substackcdn.com/image/fetch/$s_!qIUI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png 424w, https://substackcdn.com/image/fetch/$s_!qIUI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png 848w, https://substackcdn.com/image/fetch/$s_!qIUI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png 1272w, https://substackcdn.com/image/fetch/$s_!qIUI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8aa78e7-a396-4d39-bfdc-a24b0f3fee93_958x273.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The binary encoding of the instruction to multiply the values in R1 and R2, and to store the output in R3.</figcaption></figure></div><p>To decode the instruction, the control unit contains a <a href="https://en.wikipedia.org/wiki/Binary_decoder">decoder</a> - a logic circuit (similar to the adder we saw previously) which maps these opcode and register bits to the right functional unit and registers.</p><p>Based on the decoded opcode and register names, the control unit sends control signals to the register file and the ALU which internally use that signal to select the appropriate functional unit (e.g., the adder) and the right register (e.g., R1).</p><p>The ALU uses the control signal to switch on the required functional unit. In the case of the register file, a <a href="https://en.wikipedia.org/wiki/Multiplexer">multiplexer</a> is involved which selects the right register in the file, and its data flows into the ALU.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W8Eb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W8Eb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png 424w, https://substackcdn.com/image/fetch/$s_!W8Eb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png 848w, https://substackcdn.com/image/fetch/$s_!W8Eb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png 1272w, https://substackcdn.com/image/fetch/$s_!W8Eb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W8Eb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png" width="1324" height="994" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c291670a-2f2e-4f40-b686-d450a5996aab_1324x994.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:994,&quot;width&quot;:1324,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149618,&quot;alt&quot;:&quot;The control unit decodes the instruction to identify the operation and the source and destination registers for executing that operation. It sends control control signals to select the specific registers and functional units within the register file and the ALU to execute the instruction.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/160249113?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc291670a-2f2e-4f40-b686-d450a5996aab_1324x994.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The control unit decodes the instruction to identify the operation and the source and destination registers for executing that operation. It sends control control signals to select the specific registers and functional units within the register file and the ALU to execute the instruction." title="The control unit decodes the instruction to identify the operation and the source and destination registers for executing that operation. It sends control control signals to select the specific registers and functional units within the register file and the ALU to execute the instruction." srcset="https://substackcdn.com/image/fetch/$s_!W8Eb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png 424w, https://substackcdn.com/image/fetch/$s_!W8Eb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png 848w, https://substackcdn.com/image/fetch/$s_!W8Eb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png 1272w, https://substackcdn.com/image/fetch/$s_!W8Eb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18fa8e62-3962-4466-ae7b-33c5a94bfa13_1324x994.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The control unit decodes the instruction to identify the operation and the source and destination registers for executing that operation. It sends control control signals to select the specific registers and functional units within the register file and the ALU to execute the instruction.</figcaption></figure></div><h3>The Instruction Pointer and the Instruction Execution Cycle</h3><p>Up to this point, we have explained how the Control Unit decodes instructions and orchestrates data movement between the Register File and the ALU. But we haven&#8217;t discussed, <strong>from where the control unit gets these instructions?</strong></p><p>In most hardware architectures, there is a special register called the <strong>instruction pointer</strong> (often called the program counter). This register&#8217;s role is to contain the address of the next instruction of the program.</p><p>The control unit reads the address in the Instruction Pointer and fetches the instruction from main memory. The instruction fetched is then stored in another special register called the <strong>instruction register (IR)</strong>. Once the instruction register has a new instruction, the whole process of decoding and executing instruction takes place.</p><p>These registers enable what&#8217;s known as the <strong>instruction execution cycle</strong> - the fundamental process by which all programs run:</p><ol><li><p><strong>Fetch</strong>: The Control Unit reads the address stored in the Instruction Pointer and retrieves the instruction from that memory location, placing it in the Instruction Register.</p></li><li><p><strong>Decode</strong>: The Control Unit decodes the instruction in the IR, determining which operation to perform and which registers to use.</p></li><li><p><strong>Execute</strong>: The appropriate ALU circuit is activated to perform the computation using data from the specified registers.</p></li><li><p><strong>Store</strong>: The result is written back to the destination register (or memory, as needed).</p></li><li><p><strong>Update IP</strong>: The Instruction Pointer is incremented to point to the next instruction in memory.</p></li></ol><p>This cycle explains how a program&#8217;s instructions are executed one after another. But how is the Instruction Pointer register initialized? The answer is the operating system kernel.</p><p>When a program first starts, the operating system loads the program&#8217;s instructions into memory and initializes the Instruction Pointer to the address of the first instruction. From that point on, the hardware takes over, executing instructions one by one through this cycle.</p><div><hr></div><h2>But what about Memory?</h2><p>So far we have covered everything needed to construct a bare-bone hardware capable of doing simple arithmetic. But we have not talked about memory, even though we referred to it a few times. As most of the people reading my blog are experienced programmers, so I expect you already know how memory works. But let&#8217;s touch upon it briefly to complete the picture.</p><h3>The Main Memory</h3><p>So far, we have only talked about registers as memory. Registers are fast, fixed-sized memory units, but they have two significant limitations:</p><ul><li><p><strong>Limited Number of Registers</strong>:</p><ul><li><p>Registers are small and fast, and they are also available in limited numbers. A typical program uses much higher number of variables than the number of registers in the hardware. It necessitates the need for a larger memory which can be used when there are not enough registers.</p></li></ul></li><li><p><strong>Lack of Address-based Access</strong>:</p><ul><li><p>Registers can only hold primitive data types, e.g. integers and floats. But we need more powerful types, such as arrays and structs to build higher-level programming abstractions. These data types need address-based access. </p></li><li><p>For example, if an array <code>A</code> starts at address <code>0x04</code> and each element occupies 4 bytes, then <code>A[2]</code> can be accessed by the address <code>0x0c</code>. This isn&#8217;t possible with registers.</p></li></ul></li></ul><p>This leads to the requirement of a larger memory with an address-based access scheme, which is what the main memory provides. While we don&#8217;t have space to go deeper into the physical construction of main memory, let&#8217;s discuss it briefly.</p><h3>Physical Structure of Main Memory</h3><p>Physically, the main memory is implemented as an array of memory cells, each storing one bit of information. These cells are arranged on silicon chips in a grid of rows and columns for manufacturing reasons. However, the memory is presented to the programmer as a linear sequence of bytes, each with a unique address. There are two types of main memories depending on their physical construction:</p><ul><li><p><strong>Dynamic RAM (DRAM)</strong>: The most common type of main memory. Each cell is composed of a capacitor and a transistor. The capacitor holds a charge to represent a 1 or 0, but the charge leaks over time, so the cells must be periodically refreshed. This makes DRAM relatively slower compared to registers.</p></li><li><p><strong>Static RAM (SRAM)</strong>: Used primarily for CPU caches. Each cell is made from flip-flops instead of capacitors, making it faster and more reliable but also more expensive and power-hungry.</p></li></ul><h3>Communication Between CPU and Memory</h3><p>Interacting with the memory is not straightforward. There are multiple pieces of information required to access it. Such as</p><ul><li><p>Are we reading from memory or writing to it?</p></li><li><p>What is the address where the operation is to be performed?</p></li></ul><p>To solve this communication challenge, there are multiple buses between the processor and the memory. These are</p><ul><li><p><strong>Control Bus</strong>: Specifies whether the operation is read or write.</p></li><li><p><strong>Address Bus</strong>: Carries the memory address specifying where to read from or write to.</p></li><li><p><strong>Data Bus</strong>: Transfers the actual data between the processor and the memory.</p></li></ul><p>For instance, when the control unit needs to fetch an instruction, here are the things it needs to do:</p><ul><li><p>Read the instruction address from Instruction Pointer and place it in the address bus</p></li><li><p>Update the control bus to indicate the read operation</p></li><li><p>Read the returned instruction from the data bus</p></li></ul><p>Apart from the processor using memory to fetch program instructions, program instructions can also read or write memory. Let us discuss how that is made possible.</p><h3>The Load/Store Units</h3><p>We already discussed that the control unit fetches the instructions from main memory and puts them in the instruction register for decoding and execution.</p><p>Apart from that, the program instructions can themselves read or write memory. For instance, a program that iterates through an array of integers to sum them up has to bring each element of the array from memory into one of the registers to update the sum (which itself will be kept in a register). </p><p>Most hardware architectures have instructions that can move data between registers and memory, in both directions. For example, the <code>mov</code> instruction in x86 can be used to move data between the registers, and also between a register and the memory. At the hardware level, the CPU contains special execution units called the <strong>Load and Store units</strong> that are activated to execute such instructions.</p><p>The following diagram shows the processor architecture after the introduction of memory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IAja!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IAja!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png 424w, https://substackcdn.com/image/fetch/$s_!IAja!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png 848w, https://substackcdn.com/image/fetch/$s_!IAja!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png 1272w, https://substackcdn.com/image/fetch/$s_!IAja!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IAja!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png" width="1329" height="1217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ffc9d8b3-a2e6-4e84-bb4e-d2115a208a90_1329x1217.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1217,&quot;width&quot;:1329,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:217168,&quot;alt&quot;:&quot;The organization of the computer hardware showing register file, execution units, control unit, memory and memory buses&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.codingconfessions.com/i/160249113?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffc9d8b3-a2e6-4e84-bb4e-d2115a208a90_1329x1217.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The organization of the computer hardware showing register file, execution units, control unit, memory and memory buses" title="The organization of the computer hardware showing register file, execution units, control unit, memory and memory buses" srcset="https://substackcdn.com/image/fetch/$s_!IAja!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png 424w, https://substackcdn.com/image/fetch/$s_!IAja!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png 848w, https://substackcdn.com/image/fetch/$s_!IAja!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png 1272w, https://substackcdn.com/image/fetch/$s_!IAja!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3500a7e-22a2-48d9-9bb3-3cadbecef348_1329x1217.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The organization of the computer hardware showing register file, execution units, control unit, memory and memory buses</figcaption></figure></div><div><hr></div><h2>There is No Spoon</h2><p>This is a great place to mention an interesting observation that you will appreciate.</p><p>As far as the processor is concerned, everything is just bits. The hardware doesn't see integers, floating-point numbers, strings, or even code&#8212;it only processes binary data. The distinction between these abstractions exists <strong>only in our minds</strong> or at higher layers of abstraction, such as programming languages and compilers.</p><p>The hardware doesn't know whether a value in memory is <code>float</code> or an <code>int</code>. It merely moves bits around according to the instructions it's given. For example, most processors have separate sets of registers for integer and floating-point values. When you instruct the processor to move data from memory to a register, it performs that operation without ever verifying the nature of the data.</p><p>If you mistakenly load a 32-bit floating-point value into an integer register, the hardware will interpret those bits as an integer. It doesn't validate, interpret, or question, it only executes. The compiler enforces type safety in higher-level languages, but when you're working with assembly, you are responsible for ensuring that the right instructions operate on the right data. There&#8217;s no safety net.</p><p>Even the distinction between code and data is just another layer of abstraction. When the Instruction Pointer fetches an instruction from memory, the processor simply decodes the bits as an instruction. If you somehow update the Instruction Pointer with the address of a piece of data, such as an integer, the control unit will still read it and attempt to decode it as an instruction. If the bit pattern is invalid as an instruction, the hardware will raise an exception and the program will crash. But it doesn&#8217;t do this because it realizes you provided data instead of code, it only fails because the bits didn&#8217;t correspond to a valid instruction.</p><p>In the end, <strong>the processor does not recognize or care about your abstractions</strong>. It only sees bits. This realization is similar to the moment in <em>The Matrix</em> when the child tells Neo: <strong>&#8220;There is no spoon.&#8221;</strong> .</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NTsN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NTsN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NTsN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NTsN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NTsN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NTsN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg" width="1286" height="539" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:539,&quot;width&quot;:1286,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Spoon in the Matrix - WHY There is No Spoon - Matrix4Humans&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Spoon in the Matrix - WHY There is No Spoon - Matrix4Humans" title="The Spoon in the Matrix - WHY There is No Spoon - Matrix4Humans" srcset="https://substackcdn.com/image/fetch/$s_!NTsN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg 424w, https://substackcdn.com/image/fetch/$s_!NTsN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg 848w, https://substackcdn.com/image/fetch/$s_!NTsN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!NTsN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4df4ea9a-6182-4b9f-8d43-64dc93119d99_1286x539.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The scene from The Matrix: There is no spoon. </figcaption></figure></div><div><hr></div><h2>From High-level code to the Hardware-level Execution</h2><p>We have covered the fundamentals of how the hardware is organized, now let&#8217;s connect everything from top-to-bottom and see how a high-level program is translated to machine code and executed.</p><h3>Compilation of High-Level Code to Machine Code</h3><p>As we have seen, the hardware understands only instructions encoded in binary form. But, writing programs directly in binary is cumbersome and error-prone, which is why hardware designers define mnemonic instructions: cryptic but human-readable representations of these binary instructions. This is known as the assembly language.</p><p>For example, in x86 assembly, the following instruction adds the values in the registers <code>rax</code> and <code>rdx</code>, and stores the result back into <code>rdx</code>.</p><pre><code>add %rax, %rdx</code></pre><blockquote><p><code>rax</code><em> and </em><code>rdx</code><em> are register names in x86-64 architecture, and the assembler syntax requires using </em><code>%</code><em> to represent register names. Don&#8217;t worry about these details, we will cover them during the course.</em></p></blockquote><p>Even writing assembly by hand is cumbersome, so we use high-level languages which are compiled down to assembly or directly to machine code for execution. Consider the following high-level C code:</p><pre><code>long a = 10;
long b = 20;
long sum = a + b;</code></pre><p>The C compiler will compile this into assembly code which may look like this:</p><pre><code>movq $10, %rax ; Store 10 in register rax
movq $20, %rdx ; Store 20 in register rdx
addq %rax, %rdx ; rdx = rax + rdx</code></pre><blockquote><h6>The ; character is used to write comments in some assemblers. </h6></blockquote><p>Here&#8217;s what is happening in the code:</p><ul><li><p>The <code>movq</code> instructions are used in x64 assembly to write a long (64 bit integer) value into a register. So, the two <code>movq</code> instructions are writing the values 10 and 20 into <code>rax</code> and <code>rdx</code>, respectively.</p></li><li><p>The <code>addq</code> instruction is used to add the values of the two registers. The 2nd register is used as the destination. So, this equivalent to <code>rdx = rax + rdx</code>.</p></li></ul><p>But, this assembly code isn&#8217;t what the hardware understands. So, there is another translation step where this assembly code is converted into machine code by a tool called the<strong> assembler</strong>. </p><p>The assembler translates one assembly source file at a time and creates a binary file containing the machine code. This file is called the object file. If there are multiple assembly source files, then the assembler creates one object file for each of them. </p><p>Finally, to produce the executable file, a third tool, called the <strong>linker</strong> is used that stitches these object files together into a single file. The role of the linker is to resolve the addresses of the symbols and functions across multiple files. </p><p>For instance, if your C program consists of two source files: main.c and math.c, where main.c calls functions defined in math.c, then the compiler and assembler will produce two object files: main.o and math.o. Afterwards, the linker will generate the final executable such that it contains the code for all the functions called from main. </p><p>The output of the linker is the final executable binary file. Even though it is a binary file, its format has to be understandable by the operating system because ultimately it is the OS which has to load it into memory for execution.</p><h3>The Executable Binary Format</h3><p>To execute a program, the operating system needs to load it into memory and update the Instruction Pointer (IP) register with the address of the first instruction. For this, the OS must understand the format of the executable binary file.</p><p>Operating systems standardize the format of binaries. For example, Linux and BSD systems use the <a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">Executable and Linkable Format (ELF)</a>.</p><p>In ELF, the binary file is organized as a sequence of sections, where each section stores a particular kind of data. For example:</p><ul><li><p>The <em>text</em> section holds the binary encoded instructions (the machine code).</p></li><li><p>The <em>data</em> section holds the statically declared program data (e.g., your globally initialized variables).</p></li><li><p>The <em>bss</em> section holds uninitialized data.</p></li></ul><blockquote><p><em><strong>Note:</strong> There are many more kinds of sections and the overall ELF format is more detailed, but we don&#8217;t need to know all that right now. Knowledge of the sections and why they exist will be useful when reading/writing assembly code.</em></p></blockquote><p>The organization into sections serves two purposes:</p><ul><li><p><strong>Contiguous Instruction Storage</strong>: The program instructions need to be stored contiguously in memory so that the hardware can simply increment the Instruction Pointer (IP) by the size of the instruction to get the next instruction address.</p></li><li><p><strong>Permission Management</strong>: Different kinds of data require different permissions. For example:</p><ul><li><p>Program code should be readable and executable, but not writable.</p></li><li><p>Data may be readable and writable, but not executable.</p></li><li><p>By segregating different kinds of data, the OS can set appropriate permissions to enhance security and stability.</p></li></ul></li></ul><p>This well defined format enables the operating system to parse the file and load the different sections in different regions in memory with appropriate permission flags.</p><h3>Loading the Program into Memory</h3><p>When we execute a program, the operating system loads all the code and data into memory before execution begins. Because the operating system understands the binary format, this process is straightforward. The OS allocates memory pages in different regions of memory for different sections of the binary:</p><ul><li><p>The text section is placed in executable memory.</p></li><li><p>The data section is placed in read/write memory.</p></li></ul><p>After setting up the pages and loading all the data, the operating system updates the Instruction Pointer with the address of the first instruction of the program. This is possible because the OS knows exactly where the code was loaded in memory.</p><p>Once this setup is done, control is transferred to the hardware to begin executing the program.</p><h3>The Instruction Fetch-Decode-Execute Cycle</h3><p>Now that the program is loaded into memory, the Control Unit takes over and begins executing instructions by repeating the Fetch-Decode-Execute Cycle:</p><ul><li><p><strong>Fetch</strong>: The Control Unit fetches the next instruction from memory using the address stored in the Instruction Pointer (IP).</p></li><li><p><strong>Decode</strong>: The Control Unit decodes the instruction to determine the operation, source operands, and destination register.</p></li><li><p><strong>Execute</strong>: The ALU performs the specified operation.</p></li><li><p><strong>Store</strong>: The result is written back to the appropriate register or memory location.</p></li><li><p><strong>Increment IP</strong>: The Instruction Pointer (IP) is incremented to point to the next instruction, and the cycle repeats until the program ends.</p></li></ul><p>This cycle is fundamental to how all processors operate. Each instruction is processed through these stages in a continuous loop until the program completes.</p><div><hr></div><h2>Conclusion</h2><p>We began this article by asking a simple question: What does it take to build a processor for a simple computer that can add two numbers? From this modest starting point, we&#8217;ve uncovered the fundamental architecture that underpins all modern computing.</p><p>Our simple processor led us to discover key components that exist in every computer:</p><ul><li><p><strong>Transistors and Logic Gates</strong>: The basic building blocks that implement Boolean operations and enable computation at the physical level.</p></li><li><p><strong>ALU</strong>: The computational core that began as a simple adder in our processor but expands to handle diverse operations in real processors.</p></li><li><p><strong>Registers</strong>: Fast, accessible storage locations that hold the data being actively processed.</p></li><li><p><strong>Control Unit</strong>: The orchestrator that decodes instructions and coordinates all components, determining which operations to perform and on what data.</p></li><li><p><strong>Instruction Execution Cycle</strong>: The fundamental fetch-decode-execute cycle that drives all program execution.</p></li><li><p><strong>Memory System</strong>: The larger storage hierarchy that holds both instructions and data beyond what fits in registers.</p></li></ul><p>While our simple processor serves as a clean conceptual model, real architectures like x86 and ARM are much more sophisticated. They incorporate advanced features such as pipelining, branch prediction, cache hierarchy.</p><p>Yet despite this complexity, these advanced processors still follow the same fundamental organization and principles we&#8217;ve explored. They still fetch instructions from memory, decode them to determine operations, execute using an ALU, store results, and move to the next instruction.</p><p>By understanding this bottom layer, how hardware actually implements computation, you now have a solid foundation for learning assembly programming. </p><p>In our upcoming x86 assembly course, we&#8217;ll build on this foundation by exploring the specific registers, instructions, and memory addressing modes of the x86 architecture. We&#8217;ll connect high-level programming constructs to their low-level implementation, allowing you to see through the abstractions to the true nature of computation - just as Neo finally saw the Matrix for what it really was.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Confessions of a Code Addict is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.codingconfessions.com/p/seeing-the-matrix?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://blog.codingconfessions.com/p/seeing-the-matrix?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>What to Read Next</h2><p>Real-world processors have many more advanced features for delivering high performance, such as instruction pipelining, caches, and branch prediction. If you are curious, then my recent article is the perfect stepping stone to learn about them.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5418f375-7040-4984-b849-6ead9f866db1&quot;,&quot;caption&quot;:&quot;Even the most elegant algorithms can run painfully slow when they fight against your computer's underlying hardware. The difference between mediocre and exceptional performance often comes down to whether your code works with&#8212;or against&#8212;the CPU's architecture.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Hardware-Aware Coding: CPU Architecture Concepts Every Developer Should Know&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:14520974,&quot;name&quot;:&quot;Abhinav Upadhyay&quot;,&quot;bio&quot;:&quot;I'm a systems programmer, compiler enthusiast, and performance nerd. I explore CPUs, interpreters, and OS internals, breaking down complex topics. Through Confessions of a Code Addict, I share deep dives to help developers go beyond abstractions.&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36855010-6fa5-4dc6-bd10-680bf316d237_757x757.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-03-21T11:11:05.104Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1f511d0-2519-4282-bdfd-21af1c5b744d_1472x832.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://blog.codingconfessions.com/p/hardware-aware-coding&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:158157210,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:75,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Confessions of a Code Addict&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe440a724-cff0-437a-8361-d7699406ac22_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div>]]></content:encoded></item></channel></rss>