Blog

Inference Speed or Throughput? With RDUs, You Don't Have to Choose

by SambaNova

January 15, 2026

Balancing speed (as measured in tokens/second/user) and throughput (total tokens/second of an AI server) is one of the many challenges enterprises face in deploying AI agents in production in a cost-efficient, scalable manner.

While GPUs have enabled the first wave of AI, they end up hitting the "Agentic Wall" — where GPUs cannot sustain the token speeds per request required for complex reasoning loops to support near real-time agentic use cases, especially on larger models like DeepSeek.

While general chat might feel "fast" at 20 t/sv— just above human reading speed — AI agents require much higher velocities. This is because agents often operate in "test-time compute" paradigms, involving reasoning chains, tool-use loops, and multi-step reflection before generating an answer.

To meet enterprise usability requirements, infrastructure must at least deliver sustained speeds of 200+ tokens per second (t/s) per request with larger models like DeepSeek for agentic planning. For AI deployments, a key infrastructure decision needs to be made to balance the number of requests served from AI hardware versus the per request speed required for agents.

Architecture Designed for Agentic AI

Today, SambaNova has undergone independent benchmarking by Artificial Analysis, and the results confirm that SambaRack, powered by our Reconfigurable Dataflow Unit (RDU) chip, is uniquely equipped for the agentic era.

Running the DeepSeek model, SambaNova delivers per-user speeds exceeding 250 tokens per second, as we highlighted earlier this year.

More importantly, the same system can be configured for high throughput supporting over 4500 tokens per second with 256 concurrent users, a nearly 30% increase in throughput over H200 GPUs.

Delivering higher throughput means that enterprises will generate much cheaper tokens when serving many users.

With SambaStack, enterprises and service providers have the flexibility to configure each dedicated SambaRack for either a high-speed setting, optimize for agentic AI, or a high-throughput setting to support a wide range of users. This control is critical to ensure consistent performance and cost for agents, unlike typical cloud inference services that have variable speed.

Avg Throughput and Speed vs Concurrency

The technical foundation of this performance lies in the fourth generation of our Reconfigurable Dataflow Unit (RDU), the SN40L. Unlike the memory hierarchies found in GPUs, which were originally designed for parallel graphics and AI training, the RDU is purpose-built for the high-velocity demands of generative and agentic inference.

The RDU features a unique three-tiered memory design and dataflow execution model that minimizes the data movement in the hardware.

More importantly, this architecture allows the SambaRack to hot-swap between models in milliseconds and manage many models on the same infrastructure, such as serving all the latest variants of the DeepSeek 671B parameter model from the same hardware, without the latency spikes common in shared cloud queues and GPU clusters.

This means enterprises need less hardware to power their AI agents by configuring them with the right balance between throughput and speed to reduce their total cost of ownership, while still meeting the latency requirements for their agents. Read more about how RDUs on SambaStack compare against vLLM on model switching times.

The Shift from GPUs to Inference-Specific Hardware

From coding to customer service to financial trading and more, real-time AI agents are set to become more and more ubiquitous. These AI agents will require an architecture that better balances speed and throughput, which is where the need arises for inference-first AI hardware.

As Artificial Analysis highlighted, SambaNova is delivering not only the fastest speed for DeepSeek, but also high throughput to support many requests simultaneously.

With SambaNova’s SambaStack, enterprises can fully configure the system to find the right balance for their AI agents and scale their solutions to production. 2026 will be the year enterprises are able to embrace agents with much better dedicated infrastructure.

Ready to experience the speed of SambaNova RDUs? Try running open-source models like DeepSeek on SambaCloud to see it for yourself today.

← Solving the Infrastructure Crisis for AI Inference with Dataflow

Sovereign AI: National Autonomy in the AI Era →