Does reduced precision hurt? A bit about losing bits.

Written by Etash Guha | June 20, 2024

SambaNova and Groq recently achieved 1000 tokens per second on their inference system for Meta’s LLaMa 3 8b Instruct (AI@Meta, 2024) with very different systems. There are many differences between the two systems, but the most notable in this case is that SambaNova keeps both the model weights and activations in its original mixed 16-bit and 32-bit precision, while Groq utilizes a unique form of reduced precision known as TruePoint Technology (Groq, 2024). Some recent work has highlighted how quantization for recent LLaMa 3 models can lead to non-negligible decay in model performance (Huang et al., 2024). At SambaNova, we wondered, to what extent does the difference in Groq’s and SambaNova’s precision affect model performance?

Our Methodology

To do this, we compared the response quality of the inference systems of SambaNova and Groq using a variety of tasks. We tested them on 15 general tasks, two coding tasks, and one chat-based task. The results show that the LLaMa 3 8B Instruct model from SambaNova outperforms Groq by an average of 3.16% on the general tasks. In most cases (11 out of 15), this difference is statistically significant. To put that into perspective, on MMLU (Hendrycks et al., 2020), the degradation takes LLaMa 3 8b to the performance level of Gemma 7b (Gemma Team, 2024).

To design our set of experiments, we chose a set of diverse benchmarks to evaluate the accuracy of the two systems holistically. Therefore, we picked benchmarks on general knowledge, coding, and chat capabilities. We chose benchmarks used in the Eval Gauntlet (Mosaic ML, 2023) from Databricks to evaluate general knowledge capabilities since they are both widely used and reproducible. We also add Arabic Exams (Huang et al., 2023) to evaluate the effect of Groq’s reduced precision on multilingual tasks.

Our general knowledge tasks included:

MMLU STEM
MMLU Other
MMLU Social Sciences
MMLU Humanities
Conversational Question Answering Exact-Match (CoQA EM) (Reddy et al., 2019)
Conversational Question Answering F1 (CoQA F1)
AGIEval-Math (Zhong et al., 2023)
BigBenchHardFewshot (Srivastava et al., 2022)
AI2 Reasoning Challenge (ARC) - Easy (Clark et al., 2018)
AI2 Reasoning Challenge (ARC) - Challenge
OpenBookQA (Mihaylov et al., 2018)
HellaSwag (Zellers et al., 2019)
Physical Interaction: Question Answering (PiQA)
Science exam QA (SciQ) (Welbl et al., 2017)
Arabic Exams (AExams)

Moreover, for our coding tasks, we chose the commonly used HumanEval Python (Chen et al., 2021) and Mostly Basic Programming Problems (MBPP) (Austin et al., 2021) benchmarks. These benchmarks measure how an instruction-tuned model follows coding-based questions. Finally, we evaluated both systems on the Length Controlled version of Alpaca-Eval 2.0 (Dubois et al., 2024) to see how our systems differ regarding chatbot-based capabilities.

To maintain the generalizability and reproducibility of our experiments, we utilized Eleuther AI’s commonly used LM-Eval-Harness (Gao et al., 2023). We added both SambaNova’s system and Groq as models in Eval-Harness. We use off-the-shelf evaluation code to benchmark our systems for many tasks. However, some of these tasks required log-probs, which Groq does not currently provide. Therefore, we mimicked MMLU’s generative evaluation system and converted many multiple-choice tasks to generation tasks. Our fork of LM-Eval-harness is available here: https://github.com/snova-jayr/lm-evaluation-harness-generations/tree/api_calls. We attached all the commands and a README so others can rerun our experiments. For all of these experiments, we used 5-shot evaluation since that is a commonly used fewshot parameter according to the Eval Gauntlet, and the standard few-shot parameter for the generative version of MMLU is 5. Moreover, we used greedy sampling to evaluate the two systems fairly.

For coding tasks, we used the evaluation code provided by DeepSeek to maintain reproducibility (Guo et al., 2024). We reported both methodologies to evaluate SambaNova’s and Groq’s systems on HumanEval and MBPP. We used the Python variant of these experiments. We provided our repo here: https://github.com/snova-guhae/SambaGroqEvalCode, along with a README on how to reproduce our code. We also use greedy sampling here.

For the Alpaca-Eval, we created generations for all prompts and used the alpaca-eval repository to calculate each system’s win rate against GPT-4 (Achiam et al., 2023). We used greedy sampling to form our generations. We provided our code to produce our generations for both systems and the generations themselves here: https://github.com/snova-guhae/SambaGroqAlpacaEval.

Results

We see that over the 15 general tasks, on average, the LLaMa 3 8B model from Samba-1 outperforms Groq by 3.16%. On 11 of the 15 general tasks, this difference in means is statistically significant (the mean difference in results is greater than the standard error of the difference in results). This demonstrates that Groq’s reduced precision scheme tends to degrade performance on general tasks. This difference is largest in Conversational Question Answering (CoQA), where the degradation exceeds nine percentage points. Note that we highlight both exact match and F1 score since these are common metrics reported for CoQA. Moreover, regarding coding, Samba-1 outperforms Groq on MBPP, while Groq outperforms Samba-1 on HumanEval, but the margin is within noise. On the chat-based task, Alpaca Eval 2.0 LC, we see that Samba-1 achieves higher scores than Groq in a statistically significant manner. Overall, our results support the fact that the differences in SambaNova and Groq’s precisions can lead to consistent degradation in accuracy over many tasks. We report all results in Table 1, where we bold when the difference is statistically significant.

Overall, reducing precision can be an effective way to accelerate a model’s inference. However, properly implementing reduced precision is complex and can lead to degradation in model performance, as seen previously. Certainly, finding reduced precision schemes that do not lead to degradation in performance is an open line of research. This analysis highlights how the difference in Groq’s and SambaNova’s methodologies of storing model weights and activations leads to differing performance on downstream tasks.

References:

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
Groq. TruePoint™ Technology, 2024. https://wow.groq.com/GroqDocs/TechDoc_Accuracy.pdf.
Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Huang, H., Yu, F., Zhu, J., Sun, X., Cheng, H., Song, D., Chen, Z., Alharthi, A., An, B., Liu, Z., Zhang, Z., Chen, J., Li, J., Wang, B., Zhang, L., Sun, R., Wan, X., Li, H., and Xu, J. Acegpt, localizing large language models in arabic, 2023.
Huang, W., Ma, X., Qin, H., Zheng, X., Lv, C., Chen, H., Luo, J., Qi, X., Liu, X., and Magno, M. How good are low-bit quantized llama3 models? an empirical study. CoRR, 2024.
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
MosaicML. Llm evaluation scores, 2023. https://www.mosaicml.com/llmevaluation.
Reddy, S., Chen, D., and Manning, C. D. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022.
Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi`ere, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing multiple choice science questions. In NUT@EMNLP, 2017.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.

View full post