With the release of support for DeepSeek R1 671B running at 250 tokens per second per user, SambaNova continues to be the only provider to offer high speed inference on the largest open source models. Some providers, with solutions powered by GPUs, either offer these large models without comparable high speed or use model quantization, which reduces the size of the model and sacrifices accuracy for performance. Others simply don’t offer these large models at all. Only SambaNova offers a choice of the full version of the latest very large open source models, including DeepSeek-R1 671B, Llama 3 405B, and DeepSeek V3-0324, all with extremely high inference performance.
For example, Artificial Analysis, an independent AI benchmarking firm, found that SambaNova delivers an average of 250 tokens/second/user on DeepSeek R1 671B compared to an average of only 19 tokens/second/user across providers utilizing GPUs. This means that on average, SambaNova is delivering inference performance that is more than 10X faster than what is being offered by systems powered by GPUs. The ability to access these models is important for a variety of reasons. These very large models offer exceptional accuracy that cannot be achieved with smaller models. For example, the DeepSeek R1 671B model has been shown to be competitive with the latest models from OpenAI on some benchmarks. Llama 3 405B and DeepSeek V3-0324 also offer exceptional accuracy and are fully open source.
According to Artificial Analysis:
Artificial Analysis has independently benchmarked SambaNova's new DeepSeek R1 endpoint at an impressive 255 output tokens per second, the fastest amongst providers we benchmark. DeepSeek R1 is a popular model used across diverse AI applications. SambaNova's leading speed will support developers in harnessing DeepSeek R1's capabilities in latency-sensitive environments, including multi-step agentic workflows, such as coding agents, that were previously constrained by latency limitations.George Cameron, Chief Product Officer, Artificial Analysis
The value of these large models is in the accuracy that they provide. When vendors use quantization to improve performance with these models, by definition they are reducing the size of the models, which impacts accuracy. For example, at GTC 2025, Nvidia announced exceptional performance of over 200 tokens/second using their Blackwell platform. But the model they measured on was an FP4 version of DeepSeek, which means they cut the model in half. The other thing they did was to use significantly more GPUs. So they reduced the size of the model, and added significantly more, very expensive, compute resources and claimed to be the fastest. Compare this to SambaNova, which runs the full version of DeepSeek R1 671B, on one system, with 250 tokens/second of performance.
The use of open source models offer significant benefits to customers. Open source models publicly provide both their training data and model weights. This provides users with a level of explainability that is not possible with closed models. This is critical to those in regulated industries, public companies, and those who need to be able to understand why a model gave a particular response to a prompt. Other benefits of open source models include the ability to customize and own the model, data control and security, freedom from vendor lock in, and more.
The reason that SambaNova is able to offer these models with high performance is due to our purpose-built architecture. The SambaNova platform is powered by our fourth generation processor, the SN40L. The SN40L enables SambaNova to dramatically outperform both GPUs and other types of AI accelerators in two ways.
First, the SN40L was designed to take advantage of a dataflow architecture. GPUs, which were designed to accelerate graphics processing, not AI, use a legacy architecture that requires the GPU to make multiple, redundant calls to memory. In short, the dataflow architecture eliminates the inherent inefficiencies found in GPUs for AI inference. This is one reason why the SambaNova platform delivers such higher performance. Read more on the advantages of the SambaNova SN40L.
The second advantage that the SN40L brings is a three-tiered memory architecture. The SN40L has very large memory (DRAM), high bandwidth memory (HBM), and very fast memory (SRAM). This combination of memory tiers fundamentally differs from both GPUs and other AI accelerator architectures.
By incorporating very large memory a single SambaNova system can hold TBs of models. This enables SambaNova to both run very large models and to run many models simultaneously, all on a single system. Systems without this memory architecture either cannot run large models or they must cluster multiple systems together in a way that enables them to do so, but at the cost of having a large, inefficient, and potentially cost prohibitive footprint.
The unique architecture of the SambaNova SN40L enables it to run the latest and largest models quickly and efficiently. The use of large open source models means that users can always have the latest models, the newest capabilities, and the highest accuracy. Only SambaNova delivers both the largest open source models and the highest performance.