Blog

The First Disaggregated Inference Demo for AI Agents Is Live

By SambaNova

June 3, 2026

TL;DR

SambaNova demonstrated live disaggregated inference at COMPUTEX using Nvidia B200 for prefill and SN40 RDU for decode.
Speed is 2x faster than B200-only configurations, verified by Artificial Analysis.
The architecture is live at Vector Core Compute's (VC2) data center.
Together.ai is the first commercial customer.
SN50, targeting 10x throughput at 500 tokens per second per user on MiniMax M2.7, is expected in the second half of the year.

SambaNova demonstrates how GPUs and RDUs work together to deliver premium inference for agent workloads using the right chip for the right workload.

At COMPUTEX, SambaNova demonstrated what the next era of AI inference looks like: Premium inference for AI agents powered by GPUs and RDUs, running live in the newly-announced VC2 data center for the first time.

Using Nvidia’s B200 GPU for prefill and SambaNova’s SN40 RDU for decode, the inference speed generated is 2X the speed of B200-only configurations, as verified by Artificial Intelligence.

This is running today out of Vector Core Compute's (VC2) data center, with Together.AI as the first commercial customer to use the inference capabilities from VC2.

This is a new operating model for inference providers. Coding agents are moving from novelty to daily developer workflow. OpenAI is talking about long-horizon Codex runs that can plan, edit, test, repair, and keep going for hours. Anthropic and other frontier AI companies are seeing the same demand curve: Developers want agents that can take on bigger chunks of real work, faster.

That demand creates a new infrastructure challenge. Inference providers need to serve intelligent, large models at interactive speed. They need capacity for the agent workloads coming next. And they need cost-to-serve that lets growth expand margins instead of compressing them.

That is premium inference now: Faster large-model experiences, more users served per unit of compute, and economics built for scale.

The AI Agent Inference Factory: GPUs for Prefill, RDUs for Decode

SambaNova and Intel introduced this blueprint in April: CPUs, GPUs, and RDUs working together so each part of the AI agent loop runs where it performs best. Today at COMPUTEX, SambaNova is turning that blueprint into a live demo of premium inference for AI agents.

Think of it as an AI factory.

Prefill is where the raw material enters: the prompt, the codebase, the files, the context. It is compute-heavy and highly parallel, where GPUs are well suited today. Decode is the production line: tokens coming out one after another. If that line slows down, the entire factory backs up. Large-model decode depends on memory bandwidth, chip-to-chip communication, and the ability to keep the line moving at scale. SRAM helps, but SRAM alone is not enough if the architecture cannot scale to large models and long contexts. This is where SambaNova RDUs shine.

In short, don’t pile more machines into the wrong part of the factory. Design the line correctly. Put the right chip on the right workload.

Premium inference starts by separating prefill and decode, then running each phase on the chip best suited for that work.

Agents Need to Be Faster at Long Context Lengths

Coding agents are already showing where the market is headed. OpenAI reported an agent running for about 25 hours, using about 13M tokens, and producing about 30K lines of code. That is a glimpse of what is coming. Agents will run longer. They will touch more tools. They will consume more tokens. They will become part of everyday software development.

Artificial Analysis’ Coding Agent Benchmarks show that coding-agent tasks can consume substantial token volumes, including large input-token footprints across benchmark attempts. This is why agents need a prefill optimized chip, where GPUs excel. Over multiple agentic turns, decode increasingly becomes the bottleneck, which is where RDUs are optimized for efficient speed. While prompt caching does significantly improve the speed of workflows, fast prefill is critical to deliver the most premium, yet efficient, solution.

That is also why simply adding more GPU scale does not solve this class of agent workflow. When decode dominates the experience, the fastest path forward is moving decode to hardware designed for it.

That is the right chip for the right workload.

Scale the Experience, Not the Cost

For inference providers, this is the business story: Premium inference has to scale without running out of compute.

As agents become a daily workflow, usage grows fast. More users launch more agents. Those agents run longer. They consume more tokens. Demand spikes become normal. The winners will be the providers that can keep the experience fast while serving more users from the same infrastructure footprint.

That is where disaggregated inference in our upcoming SN50 changes the equation for inference providers. At a target speed of 500 tokens per second per user on MiniMax M2.7, B300 handles prefill while SN50 handles decode, generating 10x Throughput across the system. The takeaway is simple: Faster experiences and lower cost-to-serve can reinforce each other.

B300+SN50 Inference Margins Final
At 500 tokens/sec/user, B300 + SN50 lowers output cost-to-serve versus GPU-only decode, improving provider margin at the same API price.

Premium Inference Moving from Blueprint to Proof

SambaNova is delivering the disaggregated architecture to make that possible: GPUs for prefill, SambaNova RDUs for decode, and CPUs for orchestration. VC2 is the first customer environment demonstrating the architecture in practice. Together.ai will be the first to leverage this architecture commercially. SN40 demonstrates it today. SN50 raises the bar for speed and economics in the second half of the year.

If you are building, deploying, or scaling AI agents and want premium inference without premium cost pressure, let’s talk. Our team can help you map the right chip to the right workload — and build an inference architecture designed for speed, scale, and margin.

FAQs

Disaggregated inference splits the two stages of AI inference, prefill and decode, across different hardware. Prefill handles the input (prompt, context, files) and runs on GPUs. Decode handles sequential token output and runs on SambaNova RDUs. Each stage runs on the chip best suited to it.

Agent workloads consume large token volumes across multiple turns. Over time, decode becomes the bottleneck. Simply adding more GPUs does not solve this because GPUs are not optimized for the memory bandwidth and chip-to-chip communication that large-model decode requires at scale.

Using Nvidia B200 for prefill and SambaNova SN40 for decode, the configuration delivers 2x the inference speed of B200-only setups. This has been verified by Artificial Analysis.

SN50 is SambaNova's upcoming RDU. In a disaggregated setup with B300 handling prefill, SN50 handles decode and targets 500 tokens per second per user on MiniMax M2.7, generating 10x throughput across the system. It is expected in the second half of the year.

← Build Faster Coding Agents with SambaNova’s Responses API

Gemma 4 31B Running Fastest on SambaCloud →