SambaNova and Groq recently achieved 1000 tokens per second on their inference system for Meta’s LLaMa 3 8b Instruct (AI@Meta, 2024) with very different systems. There are many differences between the two systems, but the most notable in this case is that SambaNova keeps both the model weights and activations in its original mixed 16-bit and 32-bit precision, while Groq utilizes a unique form of reduced precision known as TruePoint Technology (Groq, 2024). Some recent work has highlighted how quantization for recent LLaMa 3 models can lead to non-negligible decay in model performance (Huang et al., 2024). At SambaNova, we wondered, to what extent does the difference in Groq’s and SambaNova’s precision affect model performance?
To do this, we compared the response quality of the inference systems of SambaNova and Groq using a variety of tasks. We tested them on 15 general tasks, two coding tasks, and one chat-based task. The results show that the LLaMa 3 8B Instruct model from SambaNova outperforms Groq by an average of 3.16% on the general tasks. In most cases (11 out of 15), this difference is statistically significant. To put that into perspective, on MMLU (Hendrycks et al., 2020), the degradation takes LLaMa 3 8b to the performance level of Gemma 7b (Gemma Team, 2024).
To design our set of experiments, we chose a set of diverse benchmarks to evaluate the accuracy of the two systems holistically. Therefore, we picked benchmarks on general knowledge, coding, and chat capabilities. We chose benchmarks used in the Eval Gauntlet (Mosaic ML, 2023) from Databricks to evaluate general knowledge capabilities since they are both widely used and reproducible. We also add Arabic Exams (Huang et al., 2023) to evaluate the effect of Groq’s reduced precision on multilingual tasks.
Our general knowledge tasks included:
Moreover, for our coding tasks, we chose the commonly used HumanEval Python (Chen et al., 2021) and Mostly Basic Programming Problems (MBPP) (Austin et al., 2021) benchmarks. These benchmarks measure how an instruction-tuned model follows coding-based questions. Finally, we evaluated both systems on the Length Controlled version of Alpaca-Eval 2.0 (Dubois et al., 2024) to see how our systems differ regarding chatbot-based capabilities.
To maintain the generalizability and reproducibility of our experiments, we utilized Eleuther AI’s commonly used LM-Eval-Harness (Gao et al., 2023). We added both SambaNova’s system and Groq as models in Eval-Harness. We use off-the-shelf evaluation code to benchmark our systems for many tasks. However, some of these tasks required log-probs, which Groq does not currently provide. Therefore, we mimicked MMLU’s generative evaluation system and converted many multiple-choice tasks to generation tasks. Our fork of LM-Eval-harness is available here: https://github.com/snova-jayr/lm-evaluation-harness-generations/tree/api_calls. We attached all the commands and a README so others can rerun our experiments. For all of these experiments, we used 5-shot evaluation since that is a commonly used fewshot parameter according to the Eval Gauntlet, and the standard few-shot parameter for the generative version of MMLU is 5. Moreover, we used greedy sampling to evaluate the two systems fairly.
For coding tasks, we used the evaluation code provided by DeepSeek to maintain reproducibility (Guo et al., 2024). We reported both methodologies to evaluate SambaNova’s and Groq’s systems on HumanEval and MBPP. We used the Python variant of these experiments. We provided our repo here: https://github.com/snova-guhae/SambaGroqEvalCode, along with a README on how to reproduce our code. We also use greedy sampling here.
For the Alpaca-Eval, we created generations for all prompts and used the alpaca-eval repository to calculate each system’s win rate against GPT-4 (Achiam et al., 2023). We used greedy sampling to form our generations. We provided our code to produce our generations for both systems and the generations themselves here: https://github.com/snova-guhae/SambaGroqAlpacaEval.
We see that over the 15 general tasks, on average, the LLaMa 3 8B model from Samba-1 outperforms Groq by 3.16%. On 11 of the 15 general tasks, this difference in means is statistically significant (the mean difference in results is greater than the standard error of the difference in results). This demonstrates that Groq’s reduced precision scheme tends to degrade performance on general tasks. This difference is largest in Conversational Question Answering (CoQA), where the degradation exceeds nine percentage points. Note that we highlight both exact match and F1 score since these are common metrics reported for CoQA. Moreover, regarding coding, Samba-1 outperforms Groq on MBPP, while Groq outperforms Samba-1 on HumanEval, but the margin is within noise. On the chat-based task, Alpaca Eval 2.0 LC, we see that Samba-1 achieves higher scores than Groq in a statistically significant manner. Overall, our results support the fact that the differences in SambaNova and Groq’s precisions can lead to consistent degradation in accuracy over many tasks. We report all results in Table 1, where we bold when the difference is statistically significant.
Overall, reducing precision can be an effective way to accelerate a model’s inference. However, properly implementing reduced precision is complex and can lead to degradation in model performance, as seen previously. Certainly, finding reduced precision schemes that do not lead to degradation in performance is an open line of research. This analysis highlights how the difference in Groq’s and SambaNova’s methodologies of storing model weights and activations leads to differing performance on downstream tasks.