Today, we announced Samba-1, a Composition of Experts model with over 1 trillion parameters, built on top of open source models.
These parameters are made up of a collection of 50+ state-of-the-art expert models from the broader open source community and from SambaNova. By combining these expert models together, we are able to harness their combined power into a single Composition of Experts model that outperforms GPT-3.5 across all enterprise tasks, and outperforms or matches GPT-4 on a large subset of those tasks, but with a fraction of the compute footprint.
To demonstrate this, we would like to introduce the Enterprise Grade AI (EGAI) benchmark. The EGAI benchmark is a comprehensive collection of widely adapted benchmarks sourced from the open source community. Each benchmark is carefully selected to measure specific aspects of a model’s capability to perform tasks pertinent to enterprise applications and use cases.
At the bottom of this page, you can find a full list of all benchmarks.
But that's not all. Not only are we able to demonstrate the quality of these individual experts, but we combine them together behind a single endpoint, giving us the ability to dynamically route to one or more experts in a way that unlocks their combined power, making them work together as one single model behind a single user interface.
We also just open-sourced a scaled-down version of Samba-1, that demonstrates the power of expert models with a sophisticated router, outperforming state-of-the-art open source models, both traditional models and mixture of experts. This demonstrates that Composition of Experts is the best architecture for building state-of-the-art AI, as well as being the most cost effective and scalable.
Table 1. GPT4-Eval on 123 diverse curated queries sampled from MT-bench and UltraChat
Table 2. Benchmark results for Samba-CoE-0.1 vs. Mixtral 8x7B, Qwen 72B, Falcon 180B.
We are just getting started. In the coming months, we will expand the experts available in Samba-1 to increase the breadth of tasks, diversity of experts per task, and improve upon our router to push the state-of-the-art forward and show that a Composition of Experts is the model architecture for Enterprise AI.
You can try Samba-1 now via trysambanova.ai
You can try Samba-CoE v0.1 in our Huggingface Space
The average scores are calculated across a combination of 36 different benchmarks measured by accuracy. These metrics utilize publicly accessible data from both GPT-3.5 turbo and GPT-4. These benchmarks include GSM8K, Arc, Winogrande, TruthfulQA, MMLU, Hellaswag, ApacaEval, HumanEval, MBPP, Summarization Judge, Code Judge, Climate, Stack, OTX, VT_multi, NVDLibrary Multi, Spider, and more.
Benchmark Name | Benchmark Details | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
General | ACE05 | UniNER-7B-all | 27.6 | - | 87.6 |
General | ACE04 | UniNER-7B-all | 27.7 | - | 87.5 |
Biomed | bc5cdr | UniNER-7B-all | 53.4 | - | 91.4 |
Biomed | bc4chemd | UniNER-7B-all | 36.5 | - | 89.8 |
Biomed | JNLPBA | UniNER-7B-all | 16.6 | - | 76.6 |
General | conllpp | UniNER-7B-all | 53.5 | - | 97 |
General | Ontonotes | UniNER-7B-all | 30.7 | - | 89.1 |
Biomed | GENIA | UniNER-7B-all | 42.6 | - | 80.6 |
Biomed | ncbi | UniNER-7B-all | 43.1 | - | 88.1 |
Clinical | i2b2 2010 concepts | UniNER-7B-all | 43.6 | - | 90.6 |
Clinical | i2b2 2012 temporal | UniNER-7B-all | 29.5 | - | 82.9 |
Clinical | i2b2 2014 deid | UniNER-7B-all | 16.5 | - | 92.2 |
Clinical | i2b2 2006 deid 1B | UniNER-7B-all | 9.7 | - | 96.9 |
Biomed | BioRED | UniNER-7B-all | 39.1 | - | 89.9 |
Biomed | AnatEM | UniNER-7B-all | 26.1 | - | 89.9 |
Social | WikiANN | UniNER-7B-all | 53 | - | 86.3 |
Social | WikiNeural | UniNER-7B-all | 58.7 | - | 94.6 |
Program | MultiNERD | UniNER-7B-all | 59.1 | - | 94.5 |
Program | SOFC | UniNER-7B-all | 40.4 | - | 84.1 |
STEM | SciERC | UniNER-7B-all | 13.3 | - | 67 |
STEM | SciREX | UniNER-7B-all | 16.7 | - | 70.5 |
Law | MAPA-en-fine | UniNER-7B-all | 17.4 | - | 86.4 |
Law | MAPA-en-coarse | UniNER-7B-all | 29 | - | 76.1 |
Law | E-NER | UniNER-7B-all | 15.4 | - | 94.4 |
STEM | CrossNER science | UniNER-7B-all | 68 | - | 70.8 |
Social | CrossNER politics | UniNER-7B-all | 69.5 | - | 67.3 |
Program | CrossNER AI | UniNER-7B-all | 53.4 | - | 63.6 |
STEM | DEAL | UniNER-7B-all | 27.6 | - | 79 |
Program | Stackoverflow-NER | UniNER-7B-all | 11.6 | - | 65 |
STEM | SoMeSci | UniNER-7B-all | 2.1 | - | 81.1 |
Clinical | HarveyNER | UniNER-7B-all | 12.6 | - | 73.7 |
Social | FabNER | UniNER-7B-all | 16.3 | - | 82.2 |
Transport | FindVehicle | UniNER-7B-all | 11.5 | - | 99.1 |
Benchmark Name | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
Climate | Nexusflow/NexusRaven-V2-13B | 25.53 | 68.09 | 70.21 |
Stack | Nexusflow/NexusRaven-V2-13B | 44.95 | 48.14 | 59.90 |
Places | Nexusflow/NexusRaven-V2-13B | 25.00 | 43.75 | 50.00 |
OTX | Nexusflow/NexusRaven-V2-13B | 89.13 | 90.22 | 90.22 |
VirusTotal | Nexusflow/NexusRaven-V2-13B | 81.00 | 88.00 | 80.13 |
VT_Multi | Nexusflow/NexusRaven-V2-13B | 2.04 | 36.73 | 38.78 |
NVDLibrary Single | Nexusflow/NexusRaven-V2-13B | 48.00 | 77.00 | 66.67 |
NVDLibrary Multi | Nexusflow/NexusRaven-V2-13B | 7.14 | 7.14 | 25.00 |
VT_Multi Parallel | Nexusflow/NexusRaven-V2-13B | 14.29 | 28.57 | 42.86 |
Benchmark Name | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
Information Retrieval | Nexusflow/NexusRaven-V2-13B | 37.78 | 46.22 | 68.22 |
Application Manipulation | Nexusflow/NexusRaven-V2-13B | 46.00 | 42.89 | 47.33 |
Financial Transaction Processing | Nexusflow/NexusRaven-V2-13B | 34.22 | 44.89 | 53.11 |
Real-Time Search | Nexusflow/NexusRaven-V2-13B | 40.89 | 51.11 | 52.67 |
Benchmark Name | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
Tool Selection | Nexusflow/NexusRaven-V2-13B | 56.19 | 60 | 62.38 |
Parameter Identification | Nexusflow/NexusRaven-V2-13B | 24.29 | 32.86 | 37.62 |
Content Filling | Nexusflow/NexusRaven-V2-13B | 16.19 | 25.24 | 30 |
Benchmark Name | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
GSM8k |
81.0 |
94.8 |
87.0 |
Benchmark Name | Benchmark Details | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
Summarization Judge | Agreement Rates | Auto-J | 33.3 | 61.1 | 45.8 |
Examination Judge | Agreement Rates | Auto-J | 40.3 | 51.4 | 38.9 |
Code Judge | Agreement Rates | Auto-J | 36.7 | 68.3 | 47.5 |
Rewriting Judge | Agreement Rates | Auto-J | 32.5 | 58.3 | 49.2 |
Creative Writing Judge | Agreement Rates | Auto-J | 48.2 | 65.3 | 59.7 |
Functional Writing Judge | Agreement Rates | Auto-J | 40.4 | 67.9 | 61.7 |
General Communication Judge | Agreement Rates | Auto-J | 47.6 | 52.4 | 55.2 |
Natural Language Processing Judge | Agreement Rates | Auto-J | 45.8 | 67.8 | 57.6 |
Benchmark Name | Benchmark Details | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
Spider | Execution Accuracy | SambaCoder-nsql-llama-2-70b | 72.8 | 76.2 | 78.1 |
Benchmark Name | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
HumanEval Python | DeepSeekCoder-33B | 76.2 | 84.1 | 79.3 |
HumanEval Multilingual | DeepSeekCoder-33B | 64.9 | 76.5 | 69.2 |
MBPP | DeepSeekCoder-33B | 70.8 | 80 | 70 |
Benchmark Name | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
EXPLORE-INSTRUCT-Brainstorm | Wanfq/Explore-LM-Ext-7B-Brainstorming | 40.29 | - | 59.71 |
EXPLORE-INSTRUCT-Rewriting | Wanfq/Explore-LM-Ext-7B-Rewriting | 68.42 | - | 31.58 |
Benchmark Name | Benchmark Details | Samba-1-Expert | GPT-3.5 Turbo | GPT-4 | Samba-1 |
GSM8k | Accuracy | Xwin-Math-70B-V1.0 | 81 | 94.8 | 87 |
Arc | Accuracy | tulu-2 | 85.20 | 96.30 | 72.1 |
Winogrande | Accuracy | tulu-2 | 81.60 | 87.50 | 83.27 |
TruthfulQA | Accuracy | Lil-c3po | 57.50 | 59.00 | 68.73 |
MMLU | Accuracy | tulu-2 | 70.00 | 86.40 | 69.84 |
Hellaswag | Accuracy | tulu-2 | 85.50 | 95.30 | 88.99 |