Blog

Qwen QwQ Now on SambaNova Cloud - Try 32B Preview

by Vasanth Mohan

December 18, 2024

Qwen QwQ-32B-Preview, Alibaba’s latest open-source test-time compute model, is now available on SambaNova Cloud. This cutting-edge LLM introduces a new inference paradigm for using large language models (LLMs) - Test-Time Compute; originally popularized by OpenAI's o1 model. Instead of generating an answer immediately, the model first reasons through the problem step by step, significantly improving output accuracy. With Qwen QwQ, developers can now explore this powerful reasoning approach on high-performance infrastructure, live today on SambaNova Cloud.

Try Qwen QwQ-32B-Preview now on SambaNova Cloud!

Qwen QwQ-32B-Preview flowchart

The challenge? These models need to produce a lot more tokens before delivering the final output. As a result, the response times for these models tend to be on average 30 seconds. While there are optimizations that can be done to improve the speed of the response such as what OpenAI showcased by reducing thinking time based on the complexity of the prompt, there will still be many prompts and use cases that require a large number of tokens to be generated, which will take time to give a response to.

Because Alibaba has open sourced QwQ, the SambaNova team was able to get it optimized and running on our RDU hardware, delivering 3X speed improvement over GPU providers. Developers can get started testing this model today for free in minutes on SambaNova Cloud. Qwen QwQ-32B-Preview Output Speed

About QwQ

QwQ-32B is a 32 billion parameter model, which we are offering on SambaNova Cloud with an 8K context length. This model has demonstrated superior performance on specific benchmarks, outperforming OpenAI's o1-preview and o1-mini models in tests such as AIME and MATH, which evaluate a model's mathematical reasoning and problem-solving abilities. While showing preliminary initial impressive results, this model is still intended primarily for research as it is still in preview and as a result, tends to fall short in other quality benchmarks.

Qwen QwQ-32B-Preview Performance Table
Source: https://ollama.com/library/qwq:32b

The Open Source Advantage

Because QwQ is an open source model, not only are we able to optimize it for our RDU hardware, but equally important, we are able to transparently see how the model is producing tokens at test-time. Closed source alternatives intentionally hide the output of test-time compute from view because they are worried that the generation might be useful to build a much better model.

On SambaNova Cloud, developers can see the output of test-time and utilize that to build better fine-tuned models. We are looking forward to seeing the ecosystem use this opportunity to develop more powerful test-time compute models that can run even faster thanks to SambaNova.

See it for yourself in our Hugging Face demo developed using the SambaNova and Gradio Integration.

Qwen QwQ-32B-Preview Example

About SambaNova Cloud

SambaNova Cloud is available as a service for developers to easily integrate the best open source models with the fastest inference speeds. These speeds are powered by our state of the art AI Chip, the SN40L. Whether you are building AI agents or chatbots, fast inference speeds are a MUST for your end users to have a seamless real-time experience. Get started in minutes with these models and more on SambaNova Cloud for free today.

← Meta Llama 3.3 70B Now Available Today for Developers and Enterprises

AI 2025 Predictions: 9 Key Trends Shaping the Future of AI →