A Deep Dive into the Underlying Architecture of Groq's LPU
What powers the ground breaking performance of Groq's Langauge Processing Unit?
We’ve been talking about systems performance a lot lately. Few days back Groq made news for breaking LLM inference benchmarks with their language processing unit hardware (LPU). The success of LPU is due to the innovative design of its hardware architecture, combined with a powerful compiler. As we have been focusing on performance lately, I think understanding what lies underneath the LPU hardware and compiler would be very relevant to our discussion.
Groq has not published any papers on the LPU itself, but in the previous years they have published two papers (see this 2020 paper and this 2022 paper) explaining the design and implementation of their tensor streaming processor (TSP), and how they built a distributed inference engine using TSPs. Although they have not officially stated this, the LPU most likely is based on this distributed system or an extension of its design.
In this article we are going to break down the architecture of a TSP and its compiler, followed by that we will see how Groq built a reliable and high throughput distributed AI inference engine using these TSPs.
The architecture of the TSP is very different from a conventional CPU or GPU chip, mostly to make the TSP hardware more deterministic. Let’s first talk about what causes non-determinism in a CPU or GPU.
Understanding Non-Determinism in CPU and GPU Microarchitectures
The microarchitecture of CPUs and GPUs has many features which make the execution of instructions on these devices non-deterministic, i.e., you cannot guarantee when a particular instruction will execute, how long it will take to finish and when the results will be available.
For instance, the modern CPU has:
Super scalar architecture: capable of issuing multiple instructions per cycle
Out-of-order execution: it can execute instructions in any order
Speculative execution: In case of branches, it will make a guess on the branch condition being true or false and speculatively execute that branch ahead of time to improve throughput. If it is wrong, it has to discard its current work and go back to execute the other path of the branch.
Instruction pipelining: It divides instruction execution into multiple stages and executes them in a pipelined manner. So that as one instruction moves onto the 2nd stage of the pipeline, the CPU can put a new instruction into the first stage of the pipeline. This again improves the instruction throughput.
Multiple levels of caches: CPUs have 2-3 levels of caches to cut down the latency of instructions if they need to load data from memory.
All of these make the order and timing of instruction execution in the CPU non-deterministic and hard to reason about. GPUs also have several sources of non-determinism, such as, caches, shared and global memory, dynamic resource partitioning.
The problem with this inherent non-determinism in these devices is that it makes it hard to reason about the performance of a program and it is hard to guarantee a worst case performance bound.
As a result, Groq came up with a radical new design for their tensor streaming processor (TSP) which has a ton of parallelism but zero non-deterministic behavior. They eliminated a lot of complexity from the hardware to give the compiler more power and control over the exact scheduling and execution of instructions to guarantee a bound on the performance of a program. Let’s understand what the architecture of a TSP looks like from inside.
The Architecture of the Tensor Streaming Processor (TSP)
The hardware design of TSP is in stark contrast to that of the design of a CPU or a GPU. A conventional multicore chip has a tiled architecture where each tile represents one processing core. Internally each tile consists of a collection of functional units responsible for performing different kinds of computations. For instance arithmetic ops, memory ops, logical ops, instruction control and so on.
The designers of TSP flipped this conventional design inside out. They moved the functional units outside of the cores and arranged them in a 2d grid fashion. Each column of this grid holds functional units of a specific type and is called a slice. The following diagram shows the tiled design of a conventional multicore chip and a TSP side-by-side.
A TSP has following functional slices:
MXM: For performing matrix ops
SXM: For shift and rotate operations on vectors
MEM: Memory read/write ops
VXM: Arithmetic ops on the vectors
ICU: Instruction Control Unit – unlike other slices, this is organized horizontally and it is responsible for fetching and dispatching instructions for execution on other slices.
With an understanding of the TSP's architecture, let’s shift our focus to its core operational mode - instruction execution. This is where Groq's engineering prowess truly shines, demonstrating how the hardware's design is directly informed by its execution capabilities.
Instruction Execution in a TSP
The TSP executes instructions in SIMD (single instruction multiple data) fashion. Each functional slice consists of 20 tiles and each tile is capable of producing 16 elements of a vector. Thus, a full slice can work with and produce vectors of max size 320 elements.
These slices interact with each other in a producer-consumer fashion, where one slice produces its results as a stream and another slice receives that stream as its input and works on it.
When a vector is read from memory, it is assigned a stream id (between 0 to 31) and a direction of flow (East or West). The stream flows from one slice to the next. Each slice is free to process the stream and produce a new resultant stream or it can let the stream flow as is to the next adjacent slice.
As each tile in the TSP is capable of processing 16 elements of a vector stream, to process a full vector efficiently, the instruction execution is pipelined. At timestamp t1
, the ICU issues an instruction to the bottom most tile of one of the functional slices. At the next timestamp the output of that tile moves Northward into the next slice, while the ICU issues another instruction for the next part of the vector to the bottom most tile. This results in staggered execution of the instructions as shown in the diagram below:
The precision with which the TSP executes instructions is not just a product of its hardware but also of its synergistic relationship with the compiler. Examining the compiler and the instruction set architecture reveals how software and hardware coalesce to optimize performance.
The TSP’s Compiler and ISA
The designers of TSP have simplified the hardware and pushed many complexities associated with instruction execution into the compiler. The compiler needs to precisely schedule the instructions and data flow to correctly execute the given program and do it in the most efficient way possible.
The compiler has access to the following architecturally visible state of the TSP hardware:
320-lane programming abstraction: Each tile in the TSP chip is capable of operating on 16 elements of a vector (16 lanes) in SIMD manner. A vertical slice consists of 20 such tiles, thus there are a total of 320 SIMD lanes which are available for execution.
144 Independent Instruction Queues: There are 144 instruction queues on the chip, each capable of issuing one or more instructions per cycle. The compiler has full control over the program order in each of these queues.
64 logical streams per lane: Each lane has access to 64 logical streams which can be used to move the operands or the results. These are divided between East and West directions, i.e. 32 of these can be used for moving data Eastward while the other 32 for moving data Westward.
220 MiBytes Globally Shared SRAM
As there is no non-deterministic behavior in the TSP hardware, the compiler has accurate knowledge about the latency of each instruction. Also, the compiler has full knowledge about the flow of data in the program being compiled (e.g. the computation graph of a deep neural net).
The compiler identifies the dependencies between the computation tasks and based on that assigns them for parallel execution on the available functional units on the TSP.
It schedules the instructions in such a way that the stream arrives at a slice just when the slice is ready to process it. The compiler can do this because it has precise knowledge of instruction latencies and data flow directions.
Conceptually, the compiler is solving a two-dimensional scheduling of instructions and data in both time and space. The TSP programming model relies on two critical elements:
Deterministic data path in the hardware
Exposing temporal information about an instruction’s latency through the ISA
Because of these, the compiler’s back-end can track the position and time-of-use of any stream on-chip. The paper calls this as “software-defined hardware”.
Scaling from TSP to LPU
So far we have looked at the architecture of the TSP and learned that it is massively parallel at the hardware level but devoid of any non-deterministic features. The compiler does all the heavy lifting of scheduling the instructions and data flow such that different functional slices operate on the vector data in the form of streams in SIMD manner.
It turns out the TSP is the foundational unit of the language processing unit (LPU). Many TSPs are combined in the form of a rack and many such racks are connected together to form a distributed system capable of delivering massive amounts of throughput. Let’s understand how Groq has built a distributed system out of the TSPs and how it works.
Designing a Multi-TSP System: Physical and Distributed Aspects
Just like the TSP, the design goals of a distributed multi-TSP system is also around deterministic data flow and instruction execution, and low latency communication between the nodes.
The design of a distributed TSP system starts from a node. A node is formed from 8 TSP devices enclosed within a chassis. Each of these devices consist of 11 pins, out of which 7 pins are used to connect each TSP device to the other 7 TSP devices in the node; while the remaining 4 pins are used to form a global link.
As each device in a node has 4 global links, there are a total of 32 global links which together form a 32 virtual port high-radix router.
A high-radix router supports a high number of connections, high bandwidth and high performance, just what a high performance distributed system needs.
Nine such TSP nodes with eight TSPs each are combined to form a rack. As each node in the rack has 32 ports, a rack overall has 32 x 9 = 288 global ports. 144 of these ports are used locally to provide full connectivity between the nodes for fast data flow within the rack, and the remaining 144 ports are used to connect to other racks.
A maximally configured system can support 145 such racks interconnected to each other consisting of 10,440 TSPs, with at-most 5 hops between any two TSPs in the system.
Realizing Determinism in TSP-Based Distributed Systems
In this scaled up distributed system regime, the functional units of a single TSP act as a single processing core of a massive parallel processor. The computation model of the TSP is based on deterministic hardware and the same behavior has to be extended to the distributed system regime as well.
There are a number of mechanisms to achieve this determinism and to synchronize the individual TSP units in the massively distributed system. Let’s see what they are.
Synchronizing Clocks of TSPs using Hardware Aligned Counters
Each TSP device contains a hardware counter called the hardware aligned counter (HAC) with an overflow period of 256 cycles. The TSPs use this to synchronize with each other using the following steps:
When two TSPs are interconnected, one of them transfers its HAC value to its peer. Upon receiving the value, the peer sends that value back to the sender. The sender observes the difference between its current HAC value and the value it received back. This difference is the latency of the link between the two devices. This process is repeated a number of times to establish a mean link latency between the two TSPs.
After this, the two devices are arranged in a parent child relationship. The parent
T0
periodically sends its current HAC value (HAC0
) to the childT1
. The child adds the mean link latency toHAC0
and compares that to its own HAC value. The difference between the two values represents initial misalignment due to continual clock drift.T1
adjusts its HAC value to reduce this difference. After repeating this process several times, the HAC values of the two TSPs converge within a small neighborhood which represents the jitter of the link latency.This protocol allows two TSPs to synchronize with each other and it can be extended to a multi-hop network of TSPs by establishing a spanning tree parent/child TSPs in the network.
Initial Program Alignment
Before invoking a program for execution on a multi-TSP system, all the TSPs need to be aligned to correctly schedule the data flow and instruction execution across the system. This involves the following mechanisms.
At the level of an individual TSP, there are several independent functional units and 144 independent instruction queues. First of all these need to be synchronized. To do this, the TSP supports the
SYNC
andNOTIFY
instructions for barrier synchronization. In this scheme, theSYNC
instruction puts all the instruction queues in a parked state and one of the queues acts as the notifier. When the notifier issues theNOTIFY
instruction, it is broadcast to all the queues on the chip, at which point they are synchronized and resume operations.While the
SYNC
/NOTIFY
mechanism works for a single TSP, for a multi-TSP system more work is required. Two TSPs use the HAC to sync with each other, in addition to that each TSP supports theDESKEW
instruction. When a functional unit executes aDESKEW
instruction, it stops processing any following instructions until the TSP’s HAC overflows (epoch boundary). This ensures that the next instruction is executed at an epoch boundary by the TSP. Two TSPs can align the start of the execution of a program by using the following mechanism:The child TSP puts itself in a polling loop where every epoch it checks for the arrival of a vector from the parent TSP.
The parent TSP starts the execution of a program by first performing the
DESKEW
instruction which ensures that the first instruction of the program is executed at epoch boundary.After the
DESKEW
, the parent TSP transmits a vector to the child TSP at an epoch boundary. The child TSP on receiving the vector stores it in a buffer.At the next epoch boundary, the child TSP exits its polling loop and executes the
RECEIVE
instruction to process the vector.This scheme ensures that two TSPs are correctly aligned at the start of a program execution.
To scale a multi-hop system this scheme can be executed repeatedly on each hop of the spanning tree.
Runtime Resynchronization
While the TSPs do a one time synchronization at the beginning of a program, they need to be resynchronized during program execution as well because each TSP has its own independent clock source.
For this, the TSPs use a lighter-weight scheme. Apart from an HAC, each TSP has a software aligned counter (SAC), which has the same overflow cycle as the HAC.
However, the SACs are not synchronized between TSPs, the SAC simply counts the TSP’s clock cycles. The HAC value represents the global time of the distributed system while the SAC represents the local time. Therefore, the delta between the HAC and SAC values determines the accumulated drift.
To resync the local and global times the TSP executes an RUNTIME_DESKEW
instruction. This instruction takes a single parameter: the number of cycles to stall. Each TSP in the system executes the RUNTIME_DESKEW
instruction at the same time and halts its execution to adjust the global time with the local time based on the accumulated drift.
The Role of Compiler in Software-Scheduled Networking
As we have seen, the deterministic execution guarantees are extended from a single TSP to a multi-TSP distributed system as well. This allows the compiler to have cycle-accurate knowledge of data movement within a TSP as well as across the full network. The compiler knows the exact timing of injecting a vector at a source TSP and the exact timing when it will arrive at a target TSP, this is called software-scheduled networking, because instead of the hardware dynamically managing the flow of data, the compiler is resolving everything statically at compile time.
Let’s discuss the characteristics of the distributed TSP system which enable the compiler to perform software-scheduled networking.
Known Traffic Patterns
For deep learning models, the compiler can infer the flow of data based on the static computation graph of the model. The compiler also automates the distribution of the computation tasks across the available TSP devices in the network. As a result of this, the compiler computes the precise execution time of each subtask and the exchange of the activations between the layers. This makes the parallel decomposition step explicit and under full control of the compiler.
Scheduled Data Flow
In a conventional networked system, the flow of packets through the network is managed by the hardware and the hardware optimizes the routes when it senses back-pressure in the network. This reactive adjustment in the data flow increases latency and introduces non-determinism in the flow of data.
To avoid this, a distributed multi-TSP system explicitly schedules the flow of data through the network using its compiler. The compiler smartly routes the data so that there will not be any back pressure or congestion build-up in the network at any point of time.
Apart from this, the compiler scheduled data flow also improves latency in the network because instead of a device requesting for a piece of data from another device, the compiler can schedule the 2nd device to proactively push the data to the first device just when it needs to work with that data.
Deterministic Load-balancing
Another advantage of scheduling the flow of data at compile time is that it allows the compiler to effectively load-balance the flow across the available links. In a conventional network, the hardware performs the routing decisions on a per-packet basis depending on the congestion metrics available in the router.
However, in the case of a multi-TSP system, the compiler performs this scheduling optimally based on the volume of data (i.e. product of the tensor’s dimensions H x W x C
) and selects the number of links to spread the traffic across. This efficiently uses the available bandwidth available in the system and reduces the overall latency.
Till now we have discussed what makes the LPU hardware fast, let’s conclude by discussing how the hardware handles errors.
Error Correction and System Reliability Strategy
In a conventional network system, the transmission errors are handled at the link-layer by simply retrying the packets. This strategy is not viable for the multi-TSP system because it introduces non-deterministic delays. Instead, the multi-TSP system handles errors by using the forward error correction strategy.
In this strategy, the packets are transmitted hop by hop and any errors are corrected in situ. Any critical errors are flagged so that the runtime can replay the inference. The replay strategy handles any transient errors, but persistent errors require manual intervention.
As a reliability strategy, each rack consists of a spare TSP node. The runtime monitors the health of all the nodes in a rack and fails-over to the spare node when it detects unrecoverable errors on one of the nodes.
Using software replay of inference and a spare node allows the system to gracefully recover from any critical errors.
Key Takeaways
This concludes our discussion of the architecture of the hardware underlying Groq’s LPU. We covered quite a lot of details about the TSP and how Groq extended it in the form of a massively parallel distributed system. Let’s summarize the key takeaways quickly.
Innovative Design: Groq's LPU, built on the Tensor Streaming Processor (TSP), showcases a departure from traditional CPU/GPU architectures, focusing on deterministic and predictable execution to enhance performance for language model inference tasks.
Deterministic Execution: The TSP architecture achieves determinism via a simplified hardware design, allowing compilers to precisely schedule instructions and data flow, which is critical for predictable performance.
Software-Scheduled Networking: Groq's approach to creating a distributed multi-TSP system eliminates the non-determinism typically introduced by routing and scheduling in conventional networked systems. The compiler plays a crucial role in data movement, ensuring efficiency and predictability.
Massive Parallelism: Through a combination of hardware design and compiler intelligence, TSP, and by extension, LPU, supports highly parallel operations, making it particularly suited for large language models (LLMs).
System Scalability and Reliability: The physical and logical design of the TSP network enables scalability, supporting a system of over 10,000 TSPs with minimal latency and high reliability, including error correction strategies and failover mechanisms.
Potential Impact: The deterministic nature and scalable design of LPU technology hold the promise to redefine performance expectations for computational tasks, particularly in AI and machine learning, setting a new benchmark for future processor designs.
I hope you found this article insightful. If you have any questions or comments, do let me know.
Recently, we did a live session discussing the design of TSP and LPU and analyzing how they work. Check out the recording here.
Revisiting this because it's incredible. Good work!
Great post, thank you very much!