Mojo: The Future of AI Programming 🔥

Bridging the Gap Between Python's Simplicity and High-Performance AI Code

May 18, 2023

Introduction

"The limits of my language means the limits of my world." - Ludwig Wittgenstein, philosopher and logician.

AI has made significant advancements in various domains in recent months. Breakthroughs like ChatGPT and MidJourney have been propelled by progress in research, but there is an unsung hero behind it as well — Python. Over the last two decades, Python has solidified its position as the predominant language in the AI domain due to its simple syntax and extensive ecosystem. However, Python's poor performance is a significant drawback, particularly in compute-intensive domains like AI. Numerous workarounds and hacks attempt to overcome these performance limitations, but they are often insufficient and come with their own problems.

Recently, Modular.com introduced Mojo as part of their launch announcement. The startup aims to unify the way we train, deploy, and run our AI models. Mojo is part of their effort to bridge the gap between research and production when it comes to AI code. Researchers love to use Python, but production engineers often struggle to make it ready for production due to Python's performance issues.

Led by Chris Lattner, who has been behind successful compiler and programming language projects such as clang, LLVM, Apple’s Swift, and the MLIR framework, Modular is a company to watch. In this post, we will briefly describe some of the performance issues of Python, introduce Mojo, and showcase some of its key features.

Why Python is Slow?

Python is interpreted — Interpreted languages are inherently slow because an interpreter executes the code rather than the hardware directly. In contrast, compiled languages generate machine code, which is executed directly by the hardware, making them several orders of magnitude faster.

Dynamic Typing — Python's dynamic typing means that variable data types are determined at runtime, introducing overhead from type checking and handling, which contributes to the language's slowness.

Memory Management — Python uses a reference counting to manage memory. While convenient for programming, the interpreter must track memory and clean unused objects, which incurs overhead.

Lack of true multithreading — Modern CPUs have multiple cores, and maximizing performance requires utilizing these multiple threads of execution. Although Python supports multi-threading, the global interpreter lock (GIL) prevents multiple threads from running simultaneously, leaving substantial performance gains untapped. However, Python is slowly trying to reduce the scope of the GIL, e.g. in Python 3.12 they are introducing a per-interpreter GIL which means individual Python interpreters don’t share the same GIL anymore. Read this article for more details on this.

Lack of Optimizations for Certain Operations — Some operations, such as numerical computations, can be slow in Python compared to compiled languages like C or FORTRAN.

These shortcomings are design trade-offs in Python, which prioritizes productivity and simple syntax. While Python may be well-suited for most regular software engineering work, it might not be the best choice for a CPU-intensive domain like AI. (I should also note that all these performance issues are in reference to the CPython implementation of Python.)

There are workarounds for some of these issues. For example, performance-critical code can be written in compiled languages like C or C++ and called from Python. Many ML libraries, such as Numpy, Scipy, TensorFlow, and PyTorch, do this. As these libraries are implemented in C/C++, they also leverage SIMD instructions for improving the performance of code doing numerical computations.

However, not everyone is an expert in writing C. Even if they can write some C, optimizing it for maximum output from the hardware is a skill very few possess. Tools like Cython and Numba have been developed to make things easier, but they cannot guarantee optimization in all cases and cannot target the variety of hardware used today for deploying AI applications, such as CPUs, GPUs, TPUs, and custom FPGAs.

A solution that allows writing code in Python (or Python-like syntax) while achieving maximum throughput from the hardware and supporting different hardware targets is needed. Mojo fits that description perfectly.

Introducing Mojo 🔥

Mojo is designed as a systems programming language from the ground up. Its syntax is a superset of Python's, which means it supports all Python syntax and adds more features on top. Mojo comes with an ahead-of-time (AoT) compiler powered by LLVM for CLI it fast. It also has a JIT for interactive workflow and kernel fusion, plus an interpreter for metaprogramming. Moreover, it leverages the MLIR framework to support various hardware targets such as CPUs, GPUs, TPUs, and other custom hardware accelerators.

Features of Mojo

Performance: Mojo is designed to be as fast as C or Rust while offering the ease of use of Python. This balance is achieved through a combination of compiler optimizations and language features that make it easy to write efficient code. For example, it provides first-class support for writing SIMD code in Mojo, concurrency support to exploit multiple cores, and support for defining structs, which allows control over data layout in memory for maximum efficiency. Mojo is statically typed, which enables the compiler to optimize code further and perform error checks at compile time. It also supports powerful compile-time metaprogramming, allowing generic code to be compiled into multiple specialized versions parameterized by types, similar to generics in other languages but more potent.

Memory Management: Mojo uses an ownership and borrowing system to manage its memory, eliminating the need for a garbage collector and ensuring consistent performance at runtime. Mojo's compiler tracks the lifetime of a variable using static analysis and destroys data as soon as it is no longer in use.

Portability: Built on top of the MLIR framework, Mojo is portable to a wide range of hardware devices, including CPUs, GPUs, and TPUs. This compatibility allows the same code to be compiled for different devices without modifications, making Mojo an excellent choice for building AI applications.

Interoperability with Python: Mojo not only supports Python's syntax but also allows running all available Python libraries. This means popular ML libraries like Numpy, Scipy, scikit-learn, TensorFlow, and PyTorch can be imported and used. This feature is significant because past attempts to replace Python as the standard language for AI have failed due to the lack of a comparable ecosystem.

Performance Comparison Between Mojo and Python

During Modular's launch announcement, Jeremy Howard demonstrated a performance comparison between Python and Mojo using matrix multiplication, a fundamental operation in deep learning models. According to their benchmark, the most optimized Mojo implementation of matrix multiplication was 14,050 times faster than the Python implementation. They incrementally built a more sophisticated version by adding optimizations such as SIMD vectorization, execution across multiple cores, loop unrolling, and cache locality, which ultimately resulted in a 14,000x faster implementation.

They also compared Python and Mojo's performance in the Mandelbrot set problem, showing that Mojo code was approximately 35,000 times faster than Python, a significant improvement.

Instead of reproducing all of that code and explanation here, I recommend checking their notebooks for a detailed comparison — you can find them here and here.

Mojo Code Example

While exploring Mojo's syntax and delving deep into code examples is beyond the scope of this post, I will provide a simple example to demonstrate its power. Let's consider the addition of two vectors.

A vector is an ordered collection of numbers, with the number of elements in a vector being its dimension. Adding two n-dimensional vectors involves adding each of their corresponding elements, resulting in another n-dimensional vector.

Mathematically it is expressed as follows:

\(\mathbf{u} + \mathbf{v} = \sum_{i=1}^{n} (u_i + v_i) \)

If we implement this in Python, it will look like following:

This naive Python implementation is based on list iteration. For adding two 1000-dimensional vectors, it achieves a throughput of about 0.014777619620670207 GFLOP/s.

An improvement over this implementation involves using Numpy:

For the addition of two 1000-dimensional vectors, this results in a throughput of 1.4632564169008744 GFLOP/s, 99 times faster than the naive Python code

If we run the same naive Python Implementation we wrote initially as it is in Mojo we get a throughput of 0.176491 GFLOP/s which is a straight improvement of 12x without doing any additional work.

Next, we will implement the naive version in Mojo. Mojo provides support for lower-level primitives, such as pointers. We will allocate memory for our vectors using a pointer and add those vectors element-wise using a for loop, just like the naive Python implementation. Here's how it looks:

This code achieves a throughput of 2.155172 GFLOP/s, a 145x speedup over the naive Python implementation and a 1.47x speedup over the Numpy version. However, as the vector size increases, the Numpy version ultimately outperforms the naive Mojo version, which is understandable.

Lastly, let's implement a vectorized version in Mojo:

The main idea behind vectorizing code is to eliminate loops in our code, which is exactly what we have done by using the vectorize function here. Explaining this code is beyond the scope of this article as it is already too long, but I plan to cover these advanced features of Mojo in a future post.

This implementation achieves a throughput of 13.333333 GFLOP/s, 902x faster than the naive Python implementation and 9.11x faster than the Numpy version (for 1000-dimension vectors).

I conducted a larger benchmark with vectors of varying sizes to observe broader trends. The vectorized Mojo version outperforms all others by a large margin, with the Numpy version coming in second place. It's interesting to note that for smaller vector sizes, the naive Mojo version beats Numpy, but Numpy eventually catches up.

Challenges in Mojo’s Adoption

Mojo has great potential, and the team at Modular is capable of delivering on their promise. However, a potential challenge in adopting Mojo among Python programmers is the learning curve. While Mojo's syntax extends Python's, allowing code writing from day one, fully taking advantage of Mojo's powerful capabilities requires learning the rest of the language. Seasoned programmers with experience in other systems programming languages like C, C++, Rust, etc., will find it easier to pick up Mojo. However, less experienced programmers who have not worked with those languages may face a steeper learning curve. That being said, Mojo appears to be relatively simple compared to C++ or Rust, and a well-written guide could help bridge this gap.

Resources for Learning Mojo

Mojo Programming Manual: If you are curious about learning Mojo, you should check out the Mojo Programming Manual. It covers all the Mojo language features with code examples.

Mojo Playground: Currently, Mojo is not publicly available as it is still in its infancy and under development. However, you can apply for access to the Mojo Playground, where you can experiment with Mojo code inside Jupyter notebooks. Apply for access at this link.

Mojo Discussions: The Modular team and the Mojo community are very active on GitHub. If you have any questions, you can ask in the discussion forums on Mojo’s Github repo.

Discord: Finally, if you want to interact with the Mojo dev team and hang out with other members of the community, you can join Modular’s discord server.

Conclusion: The Future of AI and Mojo

In conclusion, Mojo is an exciting new language that has the potential to revolutionize the AI domain. By offering a syntax that extends Python's, Mojo enables developers to leverage their existing knowledge while taking advantage of the performance improvements and hardware support that the language provides. Moreover, Mojo's interoperability with Python libraries ensures that the vast ecosystem of AI tools and frameworks remains accessible.

However, the adoption of Mojo may face challenges, particularly in terms of the learning curve for less experienced programmers. Providing comprehensive guides and resources to ease this learning process will be crucial for the widespread adoption of Mojo in the AI community.

As the Mojo project continues to evolve and develop, it appears that it could disrupt the way AI models are built and deployed, making AI more accessible and affordable for a broader range of users. This could drastically reduce the cost of running AI in production and make it more affordable for smaller companies to deploy their own models without requiring millions of dollars in funding.

It will be exciting to see how Mojo's capabilities unfold and how it shapes the AI landscape in the years to come. By addressing the performance limitations of Python while maintaining its simplicity and extensive ecosystem, Mojo has the potential to become a game-changer in the world of AI development and deployment.

Addendum

A couple of corrections in the article.

Mojo comes with an ahead of time compiler, as well as a JIT for REPL, and kernel fusion work, as well as an interpreter for metaprogramming parts.
Even though Python has the problem of GIL today but there is some in progress work to reduce its scope. In Python 3.12, each Python sub-interpreter will have its own GIL rather than sharing one common GIL. See this article for more details.
Added a note that the performance limitations of Python discussed in the article are in reference to the CPython implementation.