Compiling Python to Run Anywhere

Sep 23

A guest post on building a Python compiler that generates optimized kernels while preserving the language’s simplicity.

7 Comments

I have written two compilers as projects for my CS degrees and a few small ones for utilities. This was in the late 90s. It's nice to have an update on the current state-of-the-art. The work you describe is a yeoman's gem. It's hard to believe when I spent many hours developing C++ code to run on Mac/Win/Linux. The groups I did research with now has an automatic pipeline to run typical workflows. What you describe is an order of magnitude. more. Rock on with your bad self(ves)! :-)

https://github.com/EpistasisLab/tpot

Expand full comment

Reply (1)

Yusuf Olokoba

Sep 27

Thank you!

Expand full comment

Terri Burns

Sep 23

so cool to look under the hood in this way - thanks for sharing yusuf + abhinav

Expand full comment

RDM

Sep 28

Thanks for this. Hard ish read (for me) so, thanks in advance with your patience with what may well be two naive and/or frankly ignorant questions:

1) Why is Cython (as embedded in various PyLibraries) insufficient? Because it somehow doesn't optimize for the processor target? What additional *performance* benefit does this confer, or is it just about 'you can keep your efficiently "C++" like code, but run across many platforms"?

2) Also, the "running multiple copies and picking best from telemetry". Very cool but...designing code for underlying chip is really designing for default microcode behavior, right? What if a user/designer were to get different, non-default chip behavior flags (say, hypothetically, changing default branch prediction depth, or other ISA-specific behaviors...)?

Expand full comment

Reply (1)

Yusuf Olokoba

Sep 30

Cython is insufficient because it is not standalone. You can compile an extension with Cython, but it still needs to run within the Python interpreter. A major design goal for our compiler was to generate code that would run without any interpreter, as if the developer started from scratch and wrote pure C++/Rust.

On your second question, I wouldn't necessarily say that designing code for a processor boils down to designing around one behaviour. Instead, there are multiple axes of variation and our goal is to find the best-performing program along all those axes. For example, we can choose to run a convolution on the CPU, GPU, or NPU. And even if we focused on one processor, there could be data-dependent variations in performance (e.g. some convolution algorithms are faster for smaller images, whereas others are faster for larger images). This amount of variation goes far beyond the microcode behaviour of a given processor.

Expand full comment

Frank Kurka

Sep 27

I asked Perplexity what it thought of the article and this is what it told me:

**Many of the objections to containers presented in the article stem from a misunderstanding of their purpose and best practices, likely compounded by inexperience in preparing efficient container images. Here are the specific issues and misconceptions identified:

***

### **1. Equating Containers with Bloated Full OS Images**

- The article claims containers are “self-contained Linux filesystems with runtime isolation,” and then implies they are always heavy snapshots of a full OS, including “unnecessary Python packages, the Python interpreter itself, or any of the other unnecessary cruft that gets bundled into the container.”

- **Reality:** Containers can—and should—be built using minimal base images (e.g., `alpine`, `distroless`), sometimes containing only the compiled app binary and required libraries. Modern containerization best practice is “minimal, not maximal.”

***

### **2. Assuming Containers are Only Suitable for Linux Servers**

- The author suggests containers “run only on Linux servers” and that to distribute universally, you must move to standalone binaries.

- **Reality:** Containers are cross-platform. Docker runs on macOS, Windows, and in the cloud. Multi-arch builds (using Docker Buildx) make images for ARM, x86, and more. You can distribute containers as easily as binaries and often with greater portability.

***

### **3. Overlooking Layering and Optimization**

- The complaint about “months of wrangling with Dockerfiles” hints at poor experience with multi-stage builds, caching, and stripping layers.

- **Reality:** A well-crafted Dockerfile builds production images with only the needed runtime/binary, not the dev environment, producing small, secure artifacts.

***

### **4. Misunderstanding Distribution Paradigm**

- The suggestion is that standalone executables are preferable because containers are bloated and slow, not because of a real limitation in container tech.

- **Reality:** Containers excel at lifecycle management, updating, orchestration (Kubernetes), versioning, and isolation. They are the backbone of industry-wide AI deployment, CI/CD, and microservices.

***

### **5. Missing the Point of Containerization**

- The author never addresses how containers solve dependency hell, OS incompatibilities, GPU driver management, and network/service integration—crucial for real AI deployment.

- **Reality:** Without containers, you'd need installers, scripts, and manual dependency management—potentially causing version conflicts, poor reproducibility, and difficult scaling.

***

### **6. Assuming Binary == Universal**

- The author claims that binaries are universally portable, but in practice, OS and hardware differences, missing libraries, and security policies still create headaches.

***

### **In Summary:**

- The article’s objections are based mostly on early, suboptimal Docker experiences and misunderstanding efficient container workflows. With modern techniques, containers are the gold standard for distributing AI workloads.

- **Distributing minimal containers (with a Muna binary inside) is efficient, secure, scalable, and universal.** The argument against containers is not well-founded and does not align with industry norms or technical best practice.

[1](https://blog.codingconfessions.com/p/compiling-python-to-run-anywhere?utm_source=%2Finbox&utm_medium=reader2)

Expand full comment

Reply (1)

Yusuf Olokoba

Sep 27Edited

Hey! I think there's a fundamental misunderstanding in Perplexity's response. We could have expanded upon the problem with containers in our intro:

The key insight is that most developers who want to use an AI model in their application have to do so in two steps: (1) build and deploy a container microservice which runs the model; and (2) send a request to that containerized microservice from their application.

What our technology enables is a total paradigm shift: developers can 'compile' their AI model to run anywhere (we handle things like dependencies, platform differences, etc); and at runtime, they can download and execute the compiled binary directly within their application's process.

A very simple example to get the point across: today, desktop applications can't ship a Docker container within their app in order to run an AI model; they have to either send a request to a web microservice, or write C++ code to run the model within their application's process. Our platform fully automates this process.

And one final note regarding containers: our technology actually improves the process of building and deploying AI models inside containers. In fact, we support running compiled AI models on cloud GPUs by spinning up a bare container and running Muna inside it: https://docs.muna.ai/predictions/remote . This saves engineering teams from having to build and manage containers and compute infrastructure.

Perplexity's summary says as much: "Distributing minimal containers (with a Muna binary inside) is efficient, secure, scalable, and universal."

Expand full comment