General Compute x SambaNova: World’s Fastest Inference Cloud

General Compute is building the world's fastest inference cloud — designed from the ground up specifically for AI agents.

Unlike traditional systems that rely on GPUs originally designed for computer graphics, General Compute is betting on purpose-built AI accelerators to unlock the next stage of market evolution. The company believes that the future of AI lies in hardware designed specifically for the unique demands of AI workloads.

Learn more about General Compute’s launch on Product Hunt.

What they do:

General Compute has launched a high-performance inference cloud built specifically for AI agents, targeting the most demanding use cases, including AI coding and voice agents. Their platform delivers:

5X faster response times compared to traditional systems.
Higher per-user throughput for latency-sensitive workloads.
Seamless integration with an OpenAI-compatible API, allowing customers to simply swap their base URL and maintain existing workflows.

This purpose-built infrastructure enables customers to run real-time AI workloads faster and more efficiently than ever before.

Challenge:

Initially, General Compute assumed the primary bottleneck in delivering premium inference performance would be energy availability. But ultimately the real constraint was more specifically access to liquid-cooled data center infrastructure at scale.

Many high-performance AI systems require liquid cooling and facilities with that technology simply don't exist today at the required scale, making broad deployment slow, costly, and impractical.

The infrastructure gap posed significant hurdles:

Performance is existential. As AI agent adoption grows, inference demand is skyrocketing. Faster performance directly translates to competitive advantage for both General Compute and its customers.
Latency is critical. For AI coding, every second of end-to-end (E2E) latency reduces developer productivity. For voice agents, time to first token (TTFT) must be near instantaneous – even a three-second delay can disrupt the user experience
Converging use cases. General Compute envisions a future where users interact with agents verbally, while those agents dynamically write code, reason, and execute tasks in real time. Running these workloads on hardware not purpose-built for AI leads to inefficiencies that undermine their mission.

Solution:

After evaluating the full landscape of silicon providers, General Compute found that running other systems at scale and in production was neither practical nor profitable. SambaNova was the only platform that delivered the speed they needed alongside competitive unit economics. Here’s why SambaNova stood out:

Unmatched Performance: SambaNova is the performance leader for models such as MiniMax 2.7, which is ideal for coding use cases. This allows General Compute to deliver faster inference and lower costs to their customers.
Scalable Deployment: Unlike liquid-cooled systems that require costly new facilities, SambaStack is air-cooled and can be deployed in existing data centers. This eliminates the need for new construction, dramatically accelerating time to production and reducing capital expenditure.
Future-Proof Technology: General Compute launched their service with SambaStack SN40, which provides the low latency required for their customers’ needs. As they transition to the SambaStack SN50, they will gain the high throughput to serve customers at scale more efficiently.
Competitive Unit Economics: SambaNova’s platform enables General Compute to achieve profitability while delivering premium performance – a critical factor in their decision-making process.

The Results:

By partnering with SambaNova, General Compute has successfully overcome the infrastructure and performance challenges that threatened to slow their growth. The collaboration has enabled them to:

Deliver 5X faster responses for latency-sensitive workloads.
Scale their operations without the need for costly, liquid-cooled infrastructure.
Provide their customers with a seamless, OpenAI-compatible experience that enhances productivity and user satisfaction.

*https://artificialanalysis.ai/models/minimax-m2-7/providers

Challenge:

Hume specializes in building the most realistic voice AI models for developers and enterprises. These models are based on LLMs, so they understand both language and a person’s voice at the same time. Their mission is to bring empathy to AI and to align AI with human well-being. To that end, the speech-LLMs they develop are capable of understanding both the tone and meaning of the spoken word. Applications for this include audio chatbots, customer service, and more.

They recently launched the highest quality speech-LLMs for text-to-speech (Octave) and speech-to-speech (EVI 3). Much of the quality comes from the models’ ability to understand language and to adjust its tone of voice naturally in response to the input. This enables a more natural conversation, which can improve user perception.

Most voice systems today have separate text-to-speech, speech-to-text, transcription, and other models connected together because they were better at each individual task, but with the latest advances in speech-language models this is no longer the case. Moreover, each of these steps adds latency to the process. Conversational human latency is 200 ms and anything longer than 1 second will sound less human. Hume AI and SambaNova have worked together to develop a solution that delivers the highest performance at the lowest latency possible.

Solution:

Hume and SambaNova have worked together to deploy Hume’s speech-language models on SambaCloud, enabling the best speech-to-speech and text-to-speech models in the world to run at conversational latency without any reduction in quality. Together, Hume AI and SambaNova provide enterprises with access to text-to-speech and speech-to-speech APIs with response times on the order of 100 ms to 300 ms, marrying hyperrealistic quality with human-like conversation latency.

For many enterprises, it is critical to deploy in private environments. Hume and SambaNova are providing Hume’s text-to-speech and speech-to-speech models through private deployments to meet these needs.

FAQs

General Compute runs its inference cloud on SambaNova's SambaStack, powered by the SN40 RDU. They are transitioning to the SN50 to increase throughput at scale.

SambaNova's RDU architecture delivers more than 3x faster inference and 4x faster time-to-first-token than GPUs for models like MiniMax 2.7, using a purpose-built dataflow architecture rather than general-purpose graphics hardware.

SambaNova's platform enables near-instantaneous time-to-first-token, which is critical for voice agent use cases where delays of even three seconds disrupt the user experience.

Blog

Build Faster Coding Agents with SambaNova’s Responses API

May 11, 2026

Blog

MiniMax M2.7 Running Fastest on SambaCloud

May 5, 2026

SambaStack

July 7, 2025

General Compute builds the world’s fastest inference cloud with SambaNova

What they do:

Challenge:

Solution:

The Results:

Most models get the right outcome the first time

More than 3X faster than GPUs for MiniMax 2.7*

4X faster time-to-first-token (TTFT) than GPUs for MiniMax 2.7*

Challenge:

Solution:

Response time

Highest quality speech LLMs

“You can run bigger models. You can take your 2 trillion parameter open-source model that you fine-tuned and you could put it on a SambaNova rack and you get unbeatable speed and unbeatable intelligence. And that just seems like a no brainer.”

— Jason Goodison, CTO, General Compute

Data center energy bottlenecks

Why SambaNova

Inference for AI and agents

FAQs

Related resources

Build Faster Coding Agents with SambaNova’s Responses API

MiniMax M2.7 Running Fastest on SambaCloud

SambaStack

Time to start building