The Architecture of Groq's LPU

What powers the ground breaking performance of Groq's Langauge Processing Unit?

Mar 01, 2024

Few days back Groq made news for breaking LLM inference benchmarks with their language processing unit hardware (LPU). The success of LPU is due to the innovative design of its hardware architecture, combined with a powerful compiler.

Traditional processors such as the CPU and GPU have many sources of non-determinism built-in their design, such as the memory hierarchy, interrupts, context switching, dynamic scheduling of instructions etc. This leaves many things in the hands of the hardware when it comes to achieving maximum possible throughput.

Groq’s LPU has been designed to eliminate all the sources of the non-determinism which enables the compiler to statically schedule the execution of each instruction along with the flow of the data through the network of chips. There is massive amount of parallelism built into the hardware which allows parallel execution of vector operations, and also batch inference of multiple inputs.

This article breaks down the architecture of the LPU chip. Here is what we will cover:

First we will see why the traditional processors such as CPUs and GPUs have so much non-determinism built into them.
Then we will cover the design of the fundamental building block of the LPU: the tensor streaming processor (TSP), which is a single processor chip capable of doing inference of deep learning models.
Finally, we will learn how Groq builds a massively parallel and distributed inference engine by combining the TSPs into a network of racks. All of this discussion is based on the two papers published by Groq at ISCA (see this 2020 paper and this 2022 paper).

TL;DR

Innovative Design: Groq's LPU, built on the Tensor Streaming Processor (TSP), showcases a departure from traditional CPU/GPU architectures, focusing on deterministic and predictable execution to enhance performance for language model inference tasks.
Deterministic Execution: The TSP architecture achieves determinism via a simplified hardware design, allowing the compiler to precisely schedule instructions and data flow, which is critical for predictable performance.
Software-Scheduled Networking: A distributed multi-TSP system eliminates the non-determinism typically introduced by routing and scheduling in conventional networked systems. The compiler plays an important role in data movement, ensuring efficiency and predictability.
Massive Parallelism: Through a combination of hardware design and compiler intelligence, TSP, and by extension, LPU, supports highly parallel operations, making it particularly suited for large language models (LLMs).
System Scalability and Reliability: The physical and logical design of the TSP network enables scalability, supporting a system of over 10,000 TSPs with minimal latency and high reliability, including error correction strategies and failover mechanisms.

a close up of a network switch box — Photo by Dimitri Karastelev on Unsplash

The architecture of the TSP is very different from a conventional CPU or GPU chip, mostly to make the TSP hardware more deterministic. Let’s first talk about what causes non-determinism in a CPU or GPU.

Understanding Non-Determinism in CPU and GPU Microarchitectures

The microarchitecture of CPUs and GPUs has many features which make the execution of instructions on these devices non-deterministic, i.e., you cannot guarantee when a particular instruction will execute, how long it will take to finish and when the results will be available.

For instance, the modern CPU has:

Super scalar architecture: capable of issuing multiple instructions per cycle
Out-of-order execution: it can execute instructions in any order
Speculative execution: In case of branches, it will make a guess on the branch condition being true or false and speculatively execute that branch ahead of time to improve throughput. If it is wrong, it has to discard its current work and go back to execute the other path of the branch.
Instruction pipelining: It divides instruction execution into multiple stages and executes them in a pipelined manner. So that as one instruction moves onto the 2nd stage of the pipeline, the CPU can put a new instruction into the first stage of the pipeline. This again improves the instruction throughput.
Multiple levels of caches: CPUs have 2-3 levels of caches to cut down the latency of instructions if they need to load data from memory.

All of these make the order and timing of instruction execution in the CPU non-deterministic and hard to reason about. GPUs also have several sources of non-determinism, such as, caches, shared and global memory, dynamic resource partitioning.

The problem with this inherent non-determinism in these devices is that it makes it hard to reason about the performance of a program and it is hard to guarantee a worst case performance bound.

As a result, Groq came up with a radical new design for their tensor streaming processor (TSP) which has a ton of parallelism but zero non-deterministic behavior. They eliminated a lot of complexity from the hardware to give the compiler more power and control over the exact scheduling and execution of instructions to guarantee a bound on the performance of a program. Let’s understand what the architecture of a TSP looks like from inside.

The Architecture of the Tensor Streaming Processor (TSP)

The hardware design of TSP is in stark contrast to that of the design of a CPU or a GPU. A conventional multicore chip has a tiled architecture where each tile represents one processing core. Internally each tile consists of a collection of functional units responsible for performing different kinds of computations. For instance arithmetic ops, memory ops, logical ops, instruction control and so on.

The designers of TSP flipped this conventional design inside out. They moved the functional units outside of the cores and arranged them in a 2d grid fashion. Each column of this grid holds functional units of a specific type and is called a slice. The following diagram shows the tiled design of a conventional multicore chip and a TSP side-by-side.

(a): The tiled architecture of a conventional multicore chip (b): The flipped design of a TSP, where each vertical set of tiles is a homogeneous collection of functional units. The diagram is taken from the 2020 paper from Groq on TSP.

A TSP has following functional slices:

MXM: For performing matrix ops
SXM: For shift and rotate operations on vectors
MEM: Memory read/write ops
VXM: Arithmetic ops on the vectors
ICU: Instruction Control Unit – unlike other slices, this is organized horizontally and it is responsible for fetching and dispatching instructions for execution on other slices.

With an understanding of the TSP's architecture, let’s shift our focus to its core operational mode - instruction execution. This is where Groq's engineering prowess truly shines, demonstrating how the hardware's design is directly informed by its execution capabilities.

Instruction Execution in a TSP

The TSP executes instructions in SIMD (single instruction multiple data) fashion. Each functional slice consists of 20 tiles and each tile is capable of producing 16 elements of a vector. Thus, a full slice can work with and produce vectors of max size 320 elements.

These slices interact with each other in a producer-consumer fashion, where one slice produces its results as a stream and another slice receives that stream as its input and works on it.

When a vector is read from memory, it is assigned a stream id (between 0 to 31) and a direction of flow (East or West). The stream flows from one slice to the next. Each slice is free to process the stream and produce a new resultant stream or it can let the stream flow as is to the next adjacent slice.

Flow of streams between the slices in the TSP. The streams can flow in East or West direction. This diagram is taken from the 2020 paper by Groq on TSP.

As each tile in the TSP is capable of processing 16 elements of a vector stream, to process a full vector efficiently, the instruction execution is pipelined. At timestamp t1, the ICU issues an instruction to the bottom most tile of one of the functional slices. At the next timestamp the output of that tile moves Northward into the next slice, while the ICU issues another instruction for the next part of the vector to the bottom most tile. This results in staggered execution of the instructions as shown in the diagram below:

The pipelined execution of instructions in the TSP leads to a staggered movement of streams between the slices. The black blocks depict the movement of a stream at different timestamps across the slices. At each timestamp the resultant stream from one slice moves Northward and East/West to the next slice for further processing. While the ICU issues another instruction at the bottom most slice.

The precision with which the TSP executes instructions is not just a product of its hardware but also of its synergistic relationship with the compiler. Examining the compiler and the instruction set architecture reveals how software and hardware coalesce to optimize performance.

The TSP’s Compiler and ISA

The designers of TSP have simplified the hardware and pushed many complexities associated with instruction execution into the compiler. The compiler needs to precisely schedule the instructions and data flow to correctly execute the given program and do it in the most efficient way possible.

The compiler has access to the following architecturally visible state of the TSP hardware:

320-lane programming abstraction: Each tile in the TSP chip is capable of operating on 16 elements of a vector (16 lanes) in SIMD manner. A vertical slice consists of 20 such tiles, thus there are a total of 320 SIMD lanes which are available for execution.
144 Independent Instruction Queues: There are 144 instruction queues on the chip, each capable of issuing one or more instructions per cycle. The compiler has full control over the program order in each of these queues.
64 logical streams per lane: Each lane has access to 64 logical streams which can be used to move the operands or the results. These are divided between East and West directions, i.e. 32 of these can be used for moving data Eastward while the other 32 for moving data Westward.
220 MiBytes Globally Shared SRAM

As there is no non-deterministic behavior in the TSP hardware, the compiler has accurate knowledge about the latency of each instruction. Also, the compiler has full knowledge about the flow of data in the program being compiled (e.g. the computation graph of a deep neural net).

The compiler identifies the dependencies between the computation tasks and based on that assigns them for parallel execution on the available functional units on the TSP.

It schedules the instructions in such a way that the stream arrives at a slice just when the slice is ready to process it. The compiler can do this because it has precise knowledge of instruction latencies and data flow directions.

Conceptually, the compiler is solving a two-dimensional scheduling of instructions and data in both time and space. The TSP programming model relies on two critical elements:

Deterministic data path in the hardware
Exposing temporal information about an instruction’s latency through the ISA

Because of these, the compiler’s back-end can track the position and time-of-use of any stream on-chip. The paper calls this as “software-defined hardware”.

Scaling from TSP to LPU

So far we have looked at the architecture of the TSP and learned that it is massively parallel at the hardware level but devoid of any non-deterministic features. The compiler does all the heavy lifting of scheduling the instructions and data flow such that different functional slices operate on the vector data in the form of streams in SIMD manner.

It turns out the TSP is the foundational unit of the language processing unit (LPU). Many TSPs are combined in the form of a rack and many such racks are connected together to form a distributed system capable of delivering massive amounts of throughput. Let’s understand how Groq has built a distributed system out of the TSPs and how it works.

Designing a Multi-TSP System: Physical and Distributed Aspects

Just like the TSP, the design goals of a distributed multi-TSP system is also around deterministic data flow and instruction execution, and low latency communication between the nodes.

The design of a distributed TSP system starts from a node. A node is formed from 8 TSP devices enclosed within a chassis. Each of these devices consist of 11 pins, out of which 7 pins are used to connect each TSP device to the other 7 TSP devices in the node; while the remaining 4 pins are used to form a global link.

As each device in a node has 4 global links, there are a total of 32 global links which together form a 32 virtual port high-radix router.

A high-radix router supports a high number of connections, high bandwidth and high performance, just what a high performance distributed system needs.

Nine such TSP nodes with eight TSPs each are combined to form a rack. As each node in the rack has 32 ports, a rack overall has 32 x 9 = 288 global ports. 144 of these ports are used locally to provide full connectivity between the nodes for fast data flow within the rack, and the remaining 144 ports are used to connect to other racks.

A maximally configured system can support 145 such racks interconnected to each other consisting of 10,440 TSPs, with at-most 5 hops between any two TSPs in the system.

How 8 TSPs form a node and 9 nodes form a full rack. Diagram taken from the 2022 Groq paper

Realizing Determinism in TSP-Based Distributed Systems

In this scaled up distributed system regime, the functional units of a single TSP act as a single processing core of a massive parallel processor. The computation model of the TSP is based on deterministic hardware and the same behavior has to be extended to the distributed system regime as well.

There are a number of mechanisms to achieve this determinism and to synchronize the individual TSP units in the massively distributed system. Let’s see what they are.

Synchronizing Clocks of TSPs using Hardware Aligned Counters

Each TSP device contains a hardware counter called the hardware aligned counter (HAC) with an overflow period of 256 cycles. The TSPs use this to synchronize with each other using the following steps:

When two TSPs are interconnected, one of them transfers its HAC value to its peer. Upon receiving the value, the peer sends that value back to the sender. The sender observes the difference between its current HAC value and the value it received back. This difference is the latency of the link between the two devices. This process is repeated a number of times to establish a mean link latency between the two TSPs.
After this, the two devices are arranged in a parent child relationship. The parent T0 periodically sends its current HAC value (HAC0) to the child T1. The child adds the mean link latency to HAC0 and compares that to its own HAC value. The difference between the two values represents initial misalignment due to continual clock drift. T1 adjusts its HAC value to reduce this difference. After repeating this process several times, the HAC values of the two TSPs converge within a small neighborhood which represents the jitter of the link latency.
This protocol allows two TSPs to synchronize with each other and it can be extended to a multi-hop network of TSPs by establishing a spanning tree parent/child TSPs in the network.

Initial Program Alignment

Before invoking a program for execution on a multi-TSP system, all the TSPs need to be aligned to correctly schedule the data flow and instruction execution across the system. This involves the following mechanisms.

At the level of an individual TSP, there are several independent functional units and 144 independent instruction queues. First of all these need to be synchronized. To do this, the TSP supports the SYNC and NOTIFY instructions for barrier synchronization. In this scheme, the SYNC instruction puts all the instruction queues in a parked state and one of the queues acts as the notifier. When the notifier issues the NOTIFY instruction, it is broadcast to all the queues on the chip, at which point they are synchronized and resume operations.
While the SYNC/NOTIFY mechanism works for a single TSP, for a multi-TSP system more work is required. Two TSPs use the HAC to sync with each other, in addition to that each TSP supports the DESKEW instruction. When a functional unit executes a DESKEW instruction, it stops processing any following instructions until the TSP’s HAC overflows (epoch boundary). This ensures that the next instruction is executed at an epoch boundary by the TSP. Two TSPs can align the start of the execution of a program by using the following mechanism:
- The child TSP puts itself in a polling loop where every epoch it checks for the arrival of a vector from the parent TSP.
- The parent TSP starts the execution of a program by first performing the DESKEW instruction which ensures that the first instruction of the program is executed at epoch boundary.
- After the DESKEW, the parent TSP transmits a vector to the child TSP at an epoch boundary. The child TSP on receiving the vector stores it in a buffer.
- At the next epoch boundary, the child TSP exits its polling loop and executes the RECEIVE instruction to process the vector.
- This scheme ensures that two TSPs are correctly aligned at the start of a program execution.
To scale a multi-hop system this scheme can be executed repeatedly on each hop of the spanning tree.

Runtime Resynchronization

While the TSPs do a one time synchronization at the beginning of a program, they need to be resynchronized during program execution as well because each TSP has its own independent clock source.

For this, the TSPs use a lighter-weight scheme. Apart from an HAC, each TSP has a software aligned counter (SAC), which has the same overflow cycle as the HAC.

However, the SACs are not synchronized between TSPs, the SAC simply counts the TSP’s clock cycles. The HAC value represents the global time of the distributed system while the SAC represents the local time. Therefore, the delta between the HAC and SAC values determines the accumulated drift.

To resync the local and global times the TSP executes an RUNTIME_DESKEW instruction. This instruction takes a single parameter: the number of cycles to stall. Each TSP in the system executes the RUNTIME_DESKEW instruction at the same time and halts its execution to adjust the global time with the local time based on the accumulated drift.

The Role of Compiler in Software-Scheduled Networking

As we have seen, the deterministic execution guarantees are extended from a single TSP to a multi-TSP distributed system as well. This allows the compiler to have cycle-accurate knowledge of data movement within a TSP as well as across the full network. The compiler knows the exact timing of injecting a vector at a source TSP and the exact timing when it will arrive at a target TSP, this is called software-scheduled networking, because instead of the hardware dynamically managing the flow of data, the compiler is resolving everything statically at compile time.

Let’s discuss the characteristics of the distributed TSP system which enable the compiler to perform software-scheduled networking.

Known Traffic Patterns

For deep learning models, the compiler can infer the flow of data based on the static computation graph of the model. The compiler also automates the distribution of the computation tasks across the available TSP devices in the network. As a result of this, the compiler computes the precise execution time of each subtask and the exchange of the activations between the layers. This makes the parallel decomposition step explicit and under full control of the compiler.

Scheduled Data Flow

In a conventional networked system, the flow of packets through the network is managed by the hardware and the hardware optimizes the routes when it senses back-pressure in the network. This reactive adjustment in the data flow increases latency and introduces non-determinism in the flow of data.

To avoid this, a distributed multi-TSP system explicitly schedules the flow of data through the network using its compiler. The compiler smartly routes the data so that there will not be any back pressure or congestion build-up in the network at any point of time.

Apart from this, the compiler scheduled data flow also improves latency in the network because instead of a device requesting for a piece of data from another device, the compiler can schedule the 2nd device to proactively push the data to the first device just when it needs to work with that data.

Deterministic Load-balancing

Another advantage of scheduling the flow of data at compile time is that it allows the compiler to effectively load-balance the flow across the available links. In a conventional network, the hardware performs the routing decisions on a per-packet basis depending on the congestion metrics available in the router.

However, in the case of a multi-TSP system, the compiler performs this scheduling optimally based on the volume of data (i.e. product of the tensor’s dimensions H x W x C) and selects the number of links to spread the traffic across. This efficiently uses the available bandwidth available in the system and reduces the overall latency.

Till now we have discussed what makes the LPU hardware fast, let’s conclude by discussing how the hardware handles errors.

Error Correction and System Reliability Strategy

In a conventional network system, the transmission errors are handled at the link-layer by simply retrying the packets. This strategy is not viable for the multi-TSP system because it introduces non-deterministic delays. Instead, the multi-TSP system handles errors by using the forward error correction strategy.

In this strategy, the packets are transmitted hop by hop and any errors are corrected in situ. Any critical errors are flagged so that the runtime can replay the inference. The replay strategy handles any transient errors, but persistent errors require manual intervention.

As a reliability strategy, each rack consists of a spare TSP node. The runtime monitors the health of all the nodes in a rack and fails-over to the spare node when it detects unrecoverable errors on one of the nodes.

Using software replay of inference and a spare node allows the system to gracefully recover from any critical errors.

Conclusion

Groq’s LPU is purpose built for workloads where the computations can be determined statically along with the data dependencies between the different steps of the computations. This fits the bill perfectly for LLM inferences.

Being massive deep learning models, the inference computation consists of a sequence of computations through the multiple layers of the model where several of those computations can be executed in parallel while some have dependencies.

By building a processor with guaranteed latencies of the instructions and data flow, the compiler can make decisions in space and time about when and where each instruction needs to execute and also statically schedule the flow of data to happen just at the right time.

This is the fundamental design of their TSP chip which when connected to a network of TSP racks forms a massively parallel and distributed inference system with error correction and fault tolerance built-in. This in essence is the reason behind the LPU’s massive throughput.

I hope you found this article insightful. If you have any questions or comments, do let me know.

Recently, we did a live session discussing the design of TSP and LPU and analyzing how they work. Check out the recording here.