Products
Technology
Resources
Solutions
Community
About

Benchmarking Samba-1

Posted by Anton McGonnell, Urmish Thakker on February 28, 2024

Today, we announced Samba-1, a Composition of Experts model with over 1 trillion parameters, built on top of open source models.

These parameters are made up of a collection of 50+ state-of-the-art expert models from the broader open source community and from SambaNova. By combining these expert models together, we are able to harness their combined power into a single Composition of Experts model that outperforms GPT-3.5 across all enterprise tasks, and outperforms or matches GPT-4 on a large subset of those tasks, but with a fraction of the compute footprint.

To demonstrate this, we would like to introduce the Enterprise Grade AI (EGAI) benchmark. The EGAI benchmark is a comprehensive collection of widely adapted benchmarks sourced from the open source community. Each benchmark is carefully selected to measure specific aspects of a model’s capability to perform tasks pertinent to enterprise applications and use cases.

samba-1-benchmark-1

Highlighting the experts

  1. Text to SQL expert - Sambacoder-nsql-Llama-2-70B is a model trained to predict the SQL query, given a text input. Built on top of Llama2, the model was pre-trained on SQL split of the stack dataset and further instruction-tuned on the NSText2SQL dataset by Number Station. This model outperforms GPT-4 on the spider benchmark. More details can be found at on the Huggingface Model Card - https://huggingface.co/sambanovasystems/SambaCoder-nsql-llama-2-70b 
  2. Math and Reasoning Expert - Xwin-Math are a series of powerful models for math problems created by Xwin-LM: Xwin-Math-7B-V1.0, Xwin-Math-13B-V1.0, and Xwin-Math-70B-V1.0. These models are fine tuned on top of Llama-2 models. Xwin-Math-70B ranks 1st on pass@1 on the GSM8k benchmark and is competitive with GPT models. These models can be found in the HF page linked. 
  3. Function Calling and API - Built on top of llama2-13B NexusRaven Flow is a new state of the art open source LLM for function calling. It matches GPT-3.5 in zero-shot for generic domains and beats it by 4% in security software. It outperforms GPT-4 by up to 30% with retrieval augmentation for software unseen during training. More details can be found here - https://huggingface.co/Nexusflow/NexusRaven-V2-13B 

At the bottom of this page, you can find a full list of all benchmarks.

Bringing the composition together

But that's not all. Not only are we able to demonstrate the quality of these individual experts, but we combine them together behind a single endpoint, giving us the ability to dynamically route to one or more experts in a way that unlocks their combined power, making them work together as one single model behind a single user interface.

We also just open-sourced a scaled-down version of Samba-1, that demonstrates the power of expert models with a sophisticated router, outperforming state-of-the-art open source models, both traditional models and mixture of experts. This demonstrates that Composition of Experts is the best architecture for building state-of-the-art AI, as well as being the most cost effective and scalable.

samba-coe-01-image-1

Table 1. GPT4-Eval on 123 diverse curated queries sampled from MT-bench and UltraChat

Screenshot 2024-02-21 at 1.17.44 PM

Table 2. Benchmark results for Samba-CoE-0.1 vs. Mixtral 8x7B, Qwen 72B, Falcon 180B.

We are just getting started. In the coming months, we will expand the experts available in Samba-1 to increase the breadth of tasks, diversity of experts per task, and improve upon our router to push the state-of-the-art forward and show that a Composition of Experts is the model architecture for Enterprise AI.

You can try Samba-1 now via trysambanova.ai

You can try Samba-CoE v0.1 in our Huggingface Space


Comprehensive Benchmarks

samba-1-egai-2

The average scores are calculated across a combination of 36 different benchmarks measured by accuracy. These metrics utilize publicly accessible data from both GPT-3.5 turbo and GPT-4. These benchmarks include GSM8K, Arc, Winogrande, TruthfulQA, MMLU, Hellaswag, ApacaEval, HumanEval, MBPP, Summarization Judge, Code Judge, Climate, Stack, OTX, VT_multi, NVDLibrary Multi, Spider, and more. 

Enterprise-Grade AI (EGAI) Benchmark - Information Extraction
Benchmark Name Benchmark Details Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
General ACE05 UniNER-7B-all 27.6 - 87.6
General ACE04 UniNER-7B-all 27.7 - 87.5
Biomed bc5cdr UniNER-7B-all 53.4 - 91.4
Biomed bc4chemd UniNER-7B-all 36.5 - 89.8
Biomed JNLPBA UniNER-7B-all 16.6 - 76.6
General conllpp UniNER-7B-all 53.5 - 97
General Ontonotes UniNER-7B-all 30.7 - 89.1
Biomed GENIA UniNER-7B-all 42.6 - 80.6
Biomed ncbi UniNER-7B-all 43.1 - 88.1
Clinical i2b2 2010 concepts UniNER-7B-all 43.6 - 90.6
Clinical i2b2 2012 temporal UniNER-7B-all 29.5 - 82.9
Clinical i2b2 2014 deid UniNER-7B-all 16.5 - 92.2
Clinical i2b2 2006 deid 1B UniNER-7B-all 9.7 - 96.9
Biomed BioRED UniNER-7B-all 39.1 - 89.9
Biomed AnatEM UniNER-7B-all 26.1 - 89.9
Social WikiANN UniNER-7B-all 53 - 86.3
Social WikiNeural UniNER-7B-all 58.7 - 94.6
Program MultiNERD UniNER-7B-all 59.1 - 94.5
Program SOFC UniNER-7B-all 40.4 - 84.1
STEM SciERC UniNER-7B-all 13.3 - 67
STEM SciREX UniNER-7B-all 16.7 - 70.5
Law MAPA-en-fine UniNER-7B-all 17.4 - 86.4
Law MAPA-en-coarse UniNER-7B-all 29 - 76.1
Law E-NER UniNER-7B-all 15.4 - 94.4
STEM CrossNER science UniNER-7B-all 68 - 70.8
Social CrossNER politics UniNER-7B-all 69.5 - 67.3
Program CrossNER AI UniNER-7B-all 53.4 - 63.6
STEM DEAL UniNER-7B-all 27.6 - 79
Program Stackoverflow-NER UniNER-7B-all 11.6 - 65
STEM SoMeSci UniNER-7B-all 2.1 - 81.1
Clinical HarveyNER UniNER-7B-all 12.6 - 73.7
Social FabNER UniNER-7B-all 16.3 - 82.2
Transport FindVehicle UniNER-7B-all 11.5 - 99.1
 
Enterprise-Grade AI (EGAI) Benchmark - Function Calling
Benchmark Name Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
Climate Nexusflow/NexusRaven-V2-13B 25.53 68.09 70.21
Stack Nexusflow/NexusRaven-V2-13B 44.95 48.14 59.90
Places Nexusflow/NexusRaven-V2-13B 25.00 43.75 50.00
OTX Nexusflow/NexusRaven-V2-13B 89.13 90.22 90.22
VirusTotal Nexusflow/NexusRaven-V2-13B 81.00 88.00 80.13
VT_Multi Nexusflow/NexusRaven-V2-13B 2.04 36.73 38.78
NVDLibrary Single Nexusflow/NexusRaven-V2-13B 48.00 77.00 66.67
NVDLibrary Multi Nexusflow/NexusRaven-V2-13B 7.14 7.14 25.00
VT_Multi Parallel Nexusflow/NexusRaven-V2-13B 14.29 28.57 42.86
Enterprise-Grade AI (EGAI) Benchmark - Information Seeking Using API
Benchmark Name Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
Information Retrieval Nexusflow/NexusRaven-V2-13B 37.78 46.22 68.22
Application Manipulation Nexusflow/NexusRaven-V2-13B 46.00 42.89 47.33
Financial Transaction Processing Nexusflow/NexusRaven-V2-13B 34.22 44.89 53.11
Real-Time Search Nexusflow/NexusRaven-V2-13B 40.89 51.11 52.67
Enterprise-Grade AI (EGAI) Benchmark - API Documentation Understanding
Benchmark Name Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
Tool Selection Nexusflow/NexusRaven-V2-13B 56.19 60 62.38
Parameter Identification Nexusflow/NexusRaven-V2-13B 24.29 32.86 37.62
Content Filling Nexusflow/NexusRaven-V2-13B 16.19 25.24 30
Enterprise-Grade AI (EGAI) Benchmark - Math and Reasoning
Benchmark Name Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1

GSM8k

Xwin-Math-70B-V1.0

81.0

94.8

87.0

Enterprise-Grade AI (EGAI) Benchmark - Content Evaluation
Benchmark Name Benchmark Details Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
Summarization Judge Agreement Rates Auto-J 33.3 61.1 45.8
Examination Judge Agreement Rates Auto-J 40.3 51.4 38.9
Code Judge Agreement Rates Auto-J 36.7 68.3 47.5
Rewriting Judge Agreement Rates Auto-J 32.5 58.3 49.2
Creative Writing Judge Agreement Rates Auto-J 48.2 65.3 59.7
Functional Writing Judge Agreement Rates Auto-J 40.4 67.9 61.7
General Communication Judge Agreement Rates Auto-J 47.6 52.4 55.2
Natural Language Processing Judge Agreement Rates Auto-J 45.8 67.8 57.6
Enterprise-Grade AI (EGAI) Benchmark - Text to SQL
Benchmark Name Benchmark Details Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
Spider Execution Accuracy SambaCoder-nsql-llama-2-70b 72.8 76.2 78.1
Enterprise-Grade AI (EGAI) Benchmark - Programming
Benchmark Name Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
HumanEval Python DeepSeekCoder-33B 76.2 84.1 79.3
HumanEval Multilingual DeepSeekCoder-33B 64.9 76.5 69.2
MBPP DeepSeekCoder-33B 70.8 80 70
Enterprise-Grade AI (EGAI) Benchmark - Text Editing
Benchmark Name Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
EXPLORE-INSTRUCT-Brainstorm Wanfq/Explore-LM-Ext-7B-Brainstorming 40.29 - 59.71
EXPLORE-INSTRUCT-Rewriting Wanfq/Explore-LM-Ext-7B-Rewriting 68.42 - 31.58
Enterprise-Grade AI (EGAI) Benchmark - Open LLM Leaderboard
Benchmark Name Benchmark Details Samba-1-Expert GPT-3.5 Turbo GPT-4 Samba-1
GSM8k Accuracy Xwin-Math-70B-V1.0 81 94.8 87
Arc Accuracy tulu-2 85.20 96.30 72.1
Winogrande Accuracy tulu-2 81.60 87.50 83.27
TruthfulQA Accuracy Lil-c3po 57.50 59.00 68.73
MMLU Accuracy tulu-2 70.00 86.40 69.84
Hellaswag Accuracy tulu-2 85.50 95.30 88.99

Topics: technology, Blog

b0e3da816e815522bc790a3920fc3e56-1
Jimmy Lin, Changran Hu, Urmish Thakker