Blog

Why SambaNova's SN40L Chip Is the Best for Inference

by Mark Gottscho & Raghu Prabhakar

September 10, 2024

The AI hardware landscape is rapidly evolving, with several innovative companies now providing compelling solutions to run datacenter inference workloads. At SambaNova, we're proud of our achievements in building high-performance AI hardware and software. But, we also acknowledge the impressive work being done by our competitors. In this blog post, we'll compare the end-user inference performance of SambaNova's technology against that of Groq and Cerebras. We congratulate them on their phenomenal engineering achievements. We’ll also explain why SambaNova’s solution is superior.

Benchmarking Inference Performance

The following table shows the end user throughput for each provider on each of three Llama 3.1 model variations. The results are impressive, with all three competitors demonstrating significantly better performance than any GPU-based offerings. Unlike most of the competition, SambaNova achieves strong results without using any quantization techniques, meaning there is no loss of accuracy.

	Custom AI accelerators			GPU providers
Model	SambaNova (16-bit)	Cerebras (16-bit)¹	Groq (8-bit)²	Fireworks (8-bit)³	OctoAI (8-bit)⁴
Llama 3.1 Instruct 405B	129	n/a	n/a	72	30
Llama 3.1 Instruct 70B	457	445	250	71	49
Llama 3.1 Instruct 8B	1042	1837	751	288	164

Table 1. User throughput in output tokens/sec (higher is better). Performance numbers for all platforms are obtained from benchmarking results published by artificialanalysis.ai.

On the largest 405B model, SambaNova demonstrates world record performance at 129 output tokens/second/user. Neither Groq nor Cerebras currently serve it.

On the 70B model, SambaNova outperforms Cerebras, Groq, and GPU providers. Figure 1 shows benchmark results from artificialanalysis.ai at the time of writing.

speed-chart

Figure 1. Output tokens per second from all providers for Llama3.1 70B Instruct
from artificialanalysis.ai at the time of writing.

On the smallest 8B model, SambaNova outperforms the best GPU offering by 5.2X. Cerebras is faster than all providers, with an impressive result of 1837 tokens/second.

But these results don’t tell the whole story.

A Closer Look at Hardware Architectures

To achieve these results, Cerebras uses multiple wafers of compute, where each wafer contains several chips. Groq uses a cluster of hundreds of chips. SambaNova uses just 16 chips.

What's striking is that despite their wildly different hardware architectures, all three competitors end up in the same ballpark in terms of end-user performance.

An examination of Cerebras' WSE3 architecture reveals that they achieve their inference performance through pipeline parallelism within and across wafers, as well as by hosting all of their weights in SRAM. This approach comes with significant scalability challenges. Running Llama 3.1 70B on WSE3 at 445 tokens/sec requires four wafers comprising 336 chips, with a total silicon area of 184,900 mm². Their configuration offers a peak compute roofline of 500 fp16 PFLOPS that is spread across four racks, with one wafer per rack.

Groq’s LPU is also an SRAM-only architecture where each chip has a silicon area of 725 mm². Workload is spread across many networked chips rather than in a few big wafers. Running Llama 3.1 70B on the LPU at 250 tokens/sec requires hundreds of chips, because each LPU has only 230 MiB of memory. We assume that they are using 576 chips for the 70B model; the compute roofline on that configuration is 432 int8 POPs, which fits into nine racks.

In contrast, SambaNova's current 70B inference configuration uses just 16 chips of SN40L, with a combination of tensor parallelism across chips and pipeline parallelism within each chip. Each SN40L chip consists of two logic dies, HBM, and direct-attached DDR DRAM. The 16 chips are interconnected with a peer-to-peer network. They offer a compute roofline of 10.2 bf16 PFLOPS.

Figure 2 illustrates the huge difference in datacenter footprint between the three offerings that is needed to serve the same Llama 3.1 70B model with bleeding-edge performance.

rack-comparison-image

Figure 2. Datacenter footprint comparison for Llama 3.1 70B inference.

Despite having 10X more dies, a 49X higher compute roofline, and holding all the weights in SRAM, Cerebras achieves similar performance as SambaNova on Llama 3.1 70B, where SambaNova is slightly higher. Meanwhile, Groq needs 9X the rack space and 36X chips, yet still runs 46% slower than SambaNova on 70B.

Figure 3 makes SambaNova’s advantage clear.

chart

Figure 3. Performance per mm² for each model, in output tokens per second

Our solution is far more compact and better utilizes its compute, with up to 40X better performance/area than Groq and 10X better than Cerebras on Llama 3.1 70B.

Cerebras and Groq do not currently offer any 405B solution. And because both of them host their weights completely in SRAM, we believe their footprints are likely to increase compared with 70B.

SambaNova is serving 405B today, in just 16 chips: with over 100 tokens/sec and full accuracy using bf16/fp32 mixed precision.

The SN40L significantly enhances energy efficiency by requiring only 16 chips in a single rack, thereby eliminating the need for a large number of chips and associated system component power consumption. The SN40L rack features standard 19-inch form factor with air cooling and delivers full utilization of its 10.2 PFLOPs of performance, making it an ideal solution for organizations seeking a cost-effective high-performance inference solution for their data centers.

SambaNova’s Hardware Differentiation

How do we achieve such high performance in the smallest footprint? Enter the SN40L, our latest generation of Reconfigurable Dataflow Unit (RDU). Models are lowered to a dataflow architecture consisting of distributed compute and memory units to execute tensor operations. Figure 4 shows the SN40L chip.

0001_Chip-Front-Straight-On---Transparent-BGa_1200w_144dpi

Figure 4. A single SN40L RDU. It contains HBM and two reconfigurable dataflow dies. The SN40L also has I/O for DDR channels, peer-to-peer network links, and a host PCIe link.

The RDU’s flexibility allows us to map each model to optimize performance using nested combinations of data, tensor, and pipeline parallelisms within each chip and across chips.

Crucially, on-chip pipeline parallelism enables higher operational intensity for activations – the key to overcoming the memory wall. And unlike GPUs, we don’t need to write custom kernels to achieve high performance – our optimizing compiler automatically fuses operations into large kernels and maps them to the hardware.

We can fuse an entire Llama 3.1 8B decoder into a single dataflow kernel and then call it in a loop without launch overheads.

The RDU is also the only AI accelerator with a tightly coupled three-tier memory system comprising SRAM, HBM, and DDR DRAM. This unique solution offers several advantages:

DDR enables large capacity for hosting hundreds of heterogeneous models and/or checkpoints on a single socket
- Enables trillion-parameter Composition of Experts and other agentic AI workloads
- Quickly swap between models without being bottlenecked by host PCIe bandwidth
HBM holds the currently running model and caches others
Large distributed on-die SRAM enables high operational intensity through spatial kernel fusion and bank-level parallelism

The RDU is not just an inference machine – it’s also designed for pre-training and fine-tuning. It has native hardware for bf16 and fp32 compute, with advanced compiler support for automatic mixed precision. We deliver competitive performance in a small footprint without any loss of accuracy.

Try our blazing fast inference performance for yourself for free at cloud.sambanova.ai.

¹Cerebras announced that they use 16-bit weights for inference.

²We believe Groq is using int8 quantization, based on their blog post about running Llama 2 70B on 576 sockets with int8.

³Fireworks has documented their approach to quantization in their blog and public announcements.

⁴OctoAI reports that their inference stack uses fp8 on Nvidia H100s.

← SubgoalXL: Pushing the Boundaries of LLM in Formal Theorem Proving

Advanced AI Apps Need Fast Inference. SambaNova Cloud Delivers It →