Products
Developers
About

Test-Time Compute Available on SambaNova Cloud with Qwen QwQ-32B-Preview

Posted by Vasanth Mohan on December 18, 2024
Vasanth Mohan

Available today on SambaNova Cloud, developers now have access to the best open source test-time compute model released by Alibaba: QwQ-32B-Preview. Test-Time Compute is a new paradigm for using large language models (LLMs) that was first widely made available by OpenAI through their o1 model. These models operate by taking their time to think through an answer (“test-time”) first before generating the final output. The primary advantage of this approach is that by producing the step by step result before generating an answer, Large Language Models are much more likely to produce an accurate response.

Qwen QwQ-32B-Preview flowchart

The challenge? These models need to produce a lot more tokens before delivering the final output. As a result, the response times for these models tend to be on average 30 seconds. While there are optimizations that can be done to improve the speed of the response such as what OpenAI showcased by reducing thinking time based on the complexity of the prompt, there will still be many prompts and use cases that require a large number of tokens to be generated, which will take time to give a response to.

Because Alibaba has open sourced QwQ, the SambaNova team was able to get it optimized and running on our RDU hardware, delivering 3X speed improvement over GPU providers. Developers can get started testing this model today for free in minutes on SambaNova Cloud. Qwen QwQ-32B-Preview Output Speed

About QwQ

QwQ-32B is a 32 billion parameter model, which we are offering on SambaNova Cloud with an 8K context length. This model has demonstrated superior performance on specific benchmarks, outperforming OpenAI's o1-preview and o1-mini models in tests such as AIME and MATH, which evaluate a model's mathematical reasoning and problem-solving abilities. While showing preliminary initial impressive results, this model is still intended primarily for research as it is still in preview and as a result, tends to fall short in other quality benchmarks. 

Qwen QwQ-32B-Preview Performance Table
Source: https://ollama.com/library/qwq:32b

The Open Source Advantage

Because QwQ is an open source model, not only are we able to optimize it for our RDU hardware, but equally important, we are able to transparently see how the model is producing tokens at test-time. Closed source alternatives intentionally hide the output of test-time compute from view because they are worried that the generation might be useful to build a much better model.

On SambaNova Cloud, developers can see the output of test-time and utilize that to build better fine-tuned models. We are looking forward to seeing the ecosystem use this opportunity to develop more powerful test-time compute models that can run even faster thanks to SambaNova.

See it for yourself in our Hugging Face demo developed using the SambaNova and Gradio Integration.

Qwen QwQ-32B-Preview Example

About SambaNova Cloud

SambaNova Cloud is available as a service for developers to easily integrate the best open source models with the fastest inference speeds. These speeds are powered by our state of the art AI Chip, the SN40L. Whether you are building AI agents or chatbots, fast inference speeds are a MUST for your end users to have a seamless real-time experience. Get started in minutes with these models and more on SambaNova Cloud for free today.

Topics: technology, Blog