My Top 5 Favourite Features in Python 3.14
Exploring the concurrency, debugging, and performance upgrades that make Python 3.14 special.
The Pi release of Python (so named because it is version 3.14, matching the digits of π) is finally here. You can go through the list of new features and major changes yourself release notes. In this post, I want to go through my top 5 favorite features of this release that I find exciting as a Python programmer and also as an engineer who loves studying system internals.
The Free Threading Python
In practical terms, the free‑threaded build allows Python programs to take advantage of multiple CPU cores concurrently, enabling true parallel execution of threads for compute‑intensive workloads.
Until Python 3.13, it was not possible to run multiple threads in parallel in Python due to the global interpreter lock (GIL), which is a global mutex inside the Python interpreter. A thread needs to acquire this lock before it can be run on the CPU. It meant that even if you had a large multicore machine, your Python process was still only using a single core. Solutions like multiprocessing were created as a workaround this limitation.
Prior to the Python 3.13 release of Python, PEP-703 was proposed to make the GIL optional. The PEP proposed a plan to introduce changes so that it would be possible to build a version of Python without the GIL by specifying a build-flag.
These changes were accepted in the 3.14 release and as a result this release of Python comes with two versions: one with the GIL still there, while the other without the GIL. If you use uv , you can install the two versions using these commands:
uv install cpython-3.14.0 #with the GIL
uv install cpython-3.14.0t #without the GIL
Note that the free threaded build of Python breaks the ABI and all the third party packages that use the C API of CPython need to be recompiled, so not all the scientific computing packages may be immediately available for use with it.
Reference Reading
The PEP-703 which describes the work behind removing the GIL is a gread read to understand the challenges behind removing the GIL and how this work has been done.
Concurrent Interpreters
A very exciting new feature in the 3.14 release is the introduction of the concurrent.interpreters module in the standard library. It allows you to run multiple Python interpreters in parallel within the same Python process. It enables yet another kind of parallelism in Python despite the GIL.
The actual implementation details behind this are tricky to explain, I will do that in another post. But if you have read my article on CPython runtime bootstrapping, you might be able to put the pieces together. But here is the executive summary.
By default, the Python process has one main interpreter and one main thread. But now, you have the ability to create multiple interpreters on demand at runtime using the concurrent.interpreters
module. These other interpreters created at runtime are also referred to as subinterpreters. Creating a subinterpreter is as easy as calling the create()
function of concurrent.interpreters
.
import concurrent.interpreters
interp1 = concurrent.interpreters.create()
After the above call, the Python process has two interpreters inside it. Internally, the runtime tracks these using a linked list of interpreter state objects. An interpreter state represents the internal execution state of an interpreter. By providing each interpreter its own interpreter state, the runtime isolates them at Python code execution level.
To execute code on this new interpreter, we can invoke its call()
method. For example:
>>> def sum(a,b):
... return a + b
...
>>> interp1.call(sum, 10, 2)
12
However, this isn’t parallel execution because there is only one thread running in the Python process. So, the runtime simply switches the thread from executing the code inside the main interpreter to executing code inside the subinterpreter.
To execute code on the interpreter in its own thread, we can use the call_in_thread()
method. Internally, this creates a new thread that executes the code in its own context. This is a non-blocking call and we cannot get the result back. So, to communicate data between interpreters, we have to create a queue using concurrent.interpreters.create_queue()
method. Here is an example that puts all of this together.
>>> def add(q, a, b):
... q.put(a+b)
...
... interp1 = concurrent.interpreters.create()
... queue = concurrent.interpreters.create_queue()
... t = interp1.call_in_thread(add, queue, 10, 2)
... result = queue.get()
... print(result)
...
12
Here, we have created a queue, and passed it to the add
function. The add
function puts the result in the queue. In the main interpreter, we poll the queue for the result using its get()
method, which blocks until there is some data in the queue.
If you are curious about how all of this works under the hood, let me know and we can cover the internals in a future post.
Reference Reading
If you want to learn more about the runtime data structures behind this, I recommend the following article:
Remote Debugging Support
Beyond concurrency, Python 3.14 also introduces major improvements in tooling.
Debugging running Python processes has always been a pain. In order to debug it using a debugger, such as pdb, you need to manually add breakpoints in the code, then restart the process and wait for them to be hit again. In production systems, this can be infeasible.
The motivation for the new feature is to simplify this experience: with Python 3.14, you can attach to a running process using python -m pdb -p <pid>
, eliminating the need to restart it.
Technically, the CPython interpreter already had provisions to allow remote processes to connect to it and navigate its runtime state. This is how remote profilers, such as scalene, pyspy and others work. As part of PEP-768, this framework has been extended to allow debuggers to connect and debug the Python interpreter.
A debugger can now attach to a Python process and update specific fields in its runtime data structures to signal that it wants to begin debugging. When the interpreter detects this, it provides a debug prompt where you can set breakpoints and debug as usual.
While pdb has already been updated to support remote debugging, this framework also exposes an API, sys.remote_exec, so external debuggers can leverage this functionality without needing low-level C integration.
Reference Video
In a past live session, I talked about how remote profilers work which is exactly how remote debugger implementation has also been done. So, if you are curious, give it a watch.
Incremental Garbage Collection
Complementing the concurrency and debugging improvements discussed earlier, this feature enhances runtime stability and responsiveness by addressing garbage collection performance.
In a past article, I explained in detail the cost of a full heap scan by the garbage collector in CPython. Needless to say it is expensive, and moreover, it also introduces unpredictable latency delays in the performance of your APIs, because when the GC is running, the interpreter does not execute any Python code. Incremental garbage collection makes the GC overhead predictable, resulting in smoother performance for latency-sensitive workloads.
Let’s first understand how the GC used to work before this change. There were three collectable generations: young generation, old generation, and the oldest generation. There were configurable thresholds for each generation that would define when the GC would scan each of those generations. For example, the young generation would be scanned once the number of objects in it exceeds 10,000.
Any object that survives a scan of the young generation gets promoted to the first old generation. The first old generation gets scanned when the young generation has been scanned a configured number of times, such as 10 times. When that happens, the GC scans both the young gen and the first old gen. Any object that survives a scan of the first old generation gets promoted to the 2nd old generation (also known as the oldest generation).
The oldest generation is scanned when the first old generation has been scanned a configured number of times. When that threshold is reached, the GC performs a full heap scan, i.e. all the three generations. Naturally, this gets expensive.
Incremental garbage collection improves this. It reduces the number of GC generations to just two: young and old. On each GC cycle, the collector scans the young generation and a fraction of the old generation. This way, the amount of work that the GC does on each cycle becomes consistent and it eliminates those long pauses and latency spikes that were there due to a full heap scan.
Reference Reading
If you want to read more about CPython’s garbage collector, I recommend the following articles:
Tail Calling Interpreter
Finally, my favorite change as part of this release is the tail calling interpreter. It is a rewrite of the bytecode dispatch loop in the CPython virtual machine and improves performance of Python code execution by ~5%.
The bytecode dispatch loop is the heart of the interpreter where the bytecode instructions of your compiled Python program are evaluated. The faster this loop runs, the faster your Python program executes, so performance improvement in this are are always very exciting to understand. I have already written a very detailed article on the design and implementation of the dispatch loop in CPython, and I have another article in progress to explain the tail calling interpreter. So, I will be brief here.
Your Python program gets compiled to a sequence of bytecode instructions. For example, the following snippet shows the bytecode instructions for a single line of code: a + b
. So, the bytecode dispatch loop iterates over these instructions one by one and executes them.
>>> import dis
>>> dis.dis(”a + b”)
0 0 RESUME 0
1 2 LOAD_NAME 0 (a)
4 LOAD_NAME 1 (b)
6 BINARY_OP 0 (+)
10 RETURN_VALUE
The most obvious way of writing this loop is using a switch case. The problem with that is that Python has hundreds of bytecode instructions, so this switch case is huge. Optimizing such large functions is hard for compilers. For example, it cannot allocate registers optimally and some of the key variables can get spilled onto the stack, resulting in poor performance.
CPython also has a computed goto based implementation of the dispatch loop but that also suffers from the same problem. If you are not familiar with computed goto based dispatch loop, read my article on the design and implementation of the CPython dispatch loop.
The tail calling interpreter solves this by separating the implementation of each bytecode instruction into an individual function. For example, there is one function for handling LOAD_NAME, another for BINARY_OP, and so on.
This implementation is called tail calling interpreter because of the way these functions are written. At their end, instead of returning, these functions call the function for the next bytecode instruction. They do this by looking up a function pointer table using the next bytecode instruction as an index. The signature and return value of each of these functions is identical, and because these calls occur at the end of the function, they are tail calls. The compiler can optimize these tail calls and convert them into jumps, which avoids the overhead of function calls.
This implementation improves performance due to one fundamental reasons:
It results in small functions for handling each bytecode instruction that the compiler can optimize much better and do optimal register allocation.
Overall, this has shown improvement over the previous switch case and computed goto based implementations. However, it requires compiler support for performing tail call optimization which is not present in all compilers. As a result, right now the feature is opt-in only and you need to build CPython from source using a supported compiler, such as clang 19.
Reference Reading
If you want to understand the internals of the CPython bytecode interpreter and the dispatch loop, read the following article:
Wrapping Up
Although there are many other new features and improvements in this release of Python, I picked these because of my interest in Python internals and performance. Apart from that, changes such as remote debugger and GIL removal are also very exciting to understand from an engineering point of view. Studying these can give you insights that can help you improve as an engineer.
I have plans to write about some of these in future posts. But if you would like me to cover something specific, let me know.