Elevating Information Retrieval and Augmenting Large Language Models
We are pleased to announce that SambaStudio now supports text embedding models. This new feature significantly boosts the information retrieval capabilities of the SambaNova Suite and enhances large language models (LLMs) with Retrieval Augmented Generation (RAG) capabilities. RAG improves base LLMs on various aspects including factuality, reduced hallucinations, and more by providing correct context to the LLM. This functionality brings benefits to our customers and elevates the efficacy of the model.
The text embedding models convert a piece of text to a vector in d-dimensional vector of real numbers as illustrated in Figure 1.

Figure 1: A text embedding model converts text into an d-dimensional embedding vector
The vectors obtained via embedding models are such that text with similar meanings are clustered together and text with different meanings are far away from each other. This makes text embeddings ubiquitous in information retrieval systems such as search, recommendation etc. The texts are converted to embedding vectors and similar looking texts to given text input can be found by looking at the closest vectors to the given text input.
More specifically, we are starting with e5_large_v2 – a versatile text embedding model that is highly performant on various use cases of text embedding. It is one of the top ranked open source models on the Massive Multitask Embedding Benchmark.
The embedding model support on SambaNova Studio comes with offline batch inference mode as well as online inference API endpoint. The offline batch mode can be used to construct approximate nearest neighbor (ANN) indexes or vector databases from existing document corpus. Whereas, the online inference API endpoint allows for embedding a piece of text in the request path and retrieving relevant documents.
With these two capabilities our customers can leverage the symbiotic relationship between information retrieval and LLMs to build powerful applications. At SambaNova, we are committed to delivering cutting-edge generative AI technology to empower our customers. The addition of embedding model support is a key stepping stone in that journey.
Code snippet to use embedding endpoint on SambaNova Studio:


Output:

The potential of RAG is best illustrated by Edgar Research Assistant:

We are excited to announce text embedding model support and Retrieval Augmented Generation (RAG) capability on SambaNova Studio.
We look forward to what our customers will create using this brand new capability. More embedding models will be added to our platform in future. Stay tuned !!
These two capabilities are key to enabling RAG in LLMs. Our customers can now chain vector databases or ANNs indexes constructed to the family of LLMs already supported on our platform.
Enabling Open Source LLMs to Become Effective Tool Manipulators
Introduction
Using tools can extend the capabilities of LLMs’ to access knowledge beyond their training data. As an example, LLMs often underperform in math compared to text generation capabilities. However, math is a well-solved problem with established tools like calculators. Rather than investing significant time in training LLMs for math at various difficulty levels, teaching them to use calculators is a more efficient and scalable approach. Similarly, such approaches can also be extended to enable LLMs to manipulate software and applications via a natural language interface. Enterprises can greatly benefit from this technology, but for security purposes they require a fully controllable LLM integrated with their software, rather than exposing their APIs to an external LLM provider, especially when the data and use case is sensitive.
Open source LLMs are getting stronger by the day on general NLP tasks. The latest models, such as Llama 2 released by Meta, also went through an extensive alignment phase after pre-training. This makes them performant in a variety of use cases. However, little understanding exists on their ability to manipulate API functions with natural language inputs. To understand this, we curated a comprehensive benchmark, ToolBench, for the community to evaluate the LLM capability of using software to tackle real-world tasks. This will enable users to gain a better understanding of where the performance gaps are between open source and proprietary models, facilitating their selection and adoption of open source models.
At SambaNova, we consistently strive to collaborate with the open source community by enabling open source models on our platform, as the community is moving at an incredibly fast pace and their innovation can bring critical value for the end customers. Thus, we are motivated to study where the gaps are and propose practical and scalable methods to overcome the weaknesses of tool manipulation on those open source models. In particular, we found that using the three following techniques can greatly reduce the gap between open source models and proprietary models:
- Model Alignment with Programmatic Data Curation,
- In-Context Learning with Demonstration Examples
- Generation Regulation with System Prompts.
These methods take about one developer day to curate data for a given tool and they can greatly reduce the accuracy gap between open source models and proprietary models like GPT-4. We demonstrate that our techniques can boost leading open-source LLMs by up to a 90% success rate, demonstrating capabilities either surpassing or closely following GPT-4 in 4 out of 8 ToolBench tasks.
Benchmarking LLMs on Tool Usage Abilities
Here at SambaNova, we built a comprehensive benchmark, ToolBench, to study where the exact gap is between open source and closed source models and evaluated the open source model performance with the enhancements. ToolBench contains eight real world tasks covering APIs in Shell, Python and text format. They range from simple tasks that require only one API call per task to harder ones that require multiple step interactions within the environment, with multiple API calls per step.
We benchmark a wide variety of open source and proprietary models and report their average accuracy on tasks from the ToolBench. Our initial results indicate proprietary models (black) are significantly better than open source models (blue). The recently released LLaMA 2 – 70b is the best among all of the open source models.

Figure 1. Average toolbench scores for proprietary and open source models. All the proprietary models are benchmarked in May 2023. All the models are evaluated in the 3-shot manner. The benchmark results and detailed breakdown can be found in the Toolbench leaderboard on HuggingFace.
Techniques Deep Dive
We observe that open source LLMs often face difficulty in (1) API selection, (2) API argument population, and (3) generating legitimate and executable code. Thus, we revisited three techniques from the LLM literature and adapted them to address the aforementioned challenges using a practical amount of human supervision.
Model Alignment with Programmatic Data Curation
We fine-tuned the model with programmatic data that is generated by randomly filling values into a set of predefined hand-crafted templates. It is not necessary for the templates to cover all the combinations of API functions. In practice, it’s good enough to have only O(n) templates to cover all API functions as well as the popular use cases.
The picture below is an example of programmatic data generation:

Figure 2. Programmatic training data generation using templates and random values
In-Context Learning with Demonstration Examples
We retrieve the top-k relevant demonstration examples for each given goal and place them in the prompt. It’s not necessary for the examples to cover all the combinations of API functions. In practice, it’s good enough to have only O(n) examples to cover all API functions as well as the popular use cases. Accuracy improves drastically with each example, but it also saturates quickly.
The picture below shows the model accuracy on a ToolBench task with different numbers of demonstration examples in the prompt.

Figure 3: In-context demonstration can improve both closed and open-source models on Home Search, a tool for browsing houses on sale.
Generation Regulation with System Prompts
System prompts are widely used in chat-oriented LLM systems to control the natural language style of the generated response. Here, we want to use the same idea to control the output to be API function calls, without using additional verbose text. This only needs to be developed once per task.
The image below shows an example of the system prompt we used for ToolBench:

Figure 4: System prompt with guidelines to only generate code in a desired format. Red parts are populated with real data for each test case during inference.
Results
The below table reflects the capability gap in tool manipulation between proprietary and open source LLMs in the out-of-the-box zero-shot setting. Using model alignment, the in-context demonstration retriever and the system prompt, open source LLMs attain a significant boost in success rate. GPT-4 is enhanced with the retriever and system prompt. Tabletop is only evaluated in the few-shot fashion.

We conducted thorough experiments of different models on the ToolBench. We observed that there is a noticeable performance gap between the API-gated GPT-4 and open-source models in the out-of-the-box zero-shot setting. However, all the models have shown improvements from the techniques, indicating that the techniques are effective in improving model performance with tool use. Further, after applying the combined enhancement techniques, the open-source models achieve competitive or better accuracy on 4 out of 8 tasks compared to GPT-4. Note that challenges still remain for the tasks that require advanced reasoning (e.g. manipulating GoogleSheets), which we will work with the open-source community to close in the future.
Conclusion
We studied ways to enhance open-source LLMs to compete with leading closed LLM APIs in tool manipulation with practical amounts of human supervision. Having a tool manipulation system is key to optimizing workflows in a variety of enterprise scenarios. If you are interested in reading more about this work, you can read this paper on arxiv and find the associated code on github. You can also track how the latest models are doing on the leaderboard at the Hugging Face leaderboard. If you have any questions about this work, come find us on SambaNova discord and join the toolbench channel.
Training long sequence size models with SambaNova
At SambaNova, we have been researching and developing methods to train long sequence size (SS) models on our platform. Training longer sequence models has become paramount to enabling a variety of business applications including summarizing long form documents, such as contracts or other legal documents, and answering questions on those documents. This is vital functionality for enterprises to incorporate generative AI into their business processes. A long sequence model also means it has a larger context window. More guiding examples can therefore be provided as part of the few-shot input prompt, leading to higher quality results.
As part of our effort to enable these business applications, we share some of our early results when training a 13 billion (B) parameter model at 8K sequence length. This approach has led to the SN-13B-8k-Instruct model, which outperforms leading long sequence open source models on long sequence size tasks.
Our average score on the long sequence benchmark suite derived from Scrolls is up to two points greater than other leading open source long sequence models – XGen from Salesforce, MPT-8K from MosaicML and LLAMA2-Chat from Meta. These models represent the current state-of-the-art open source alternatives for Long SS models. In addition, our results on the validation set of ZeroScrolls benchmark outperforms XGen, MPT and LLAMA-2 by as much as 10 points. Scrolls and ZeroScrolls are benchmarks released by Tel Aviv University and Meta which tests the ability of models to aggregate and compare information from long sequences.
SambaNova is committed to the open source community and we are releasing a checkpoint on Hugging Face for the community to use and test. This checkpoint was created primarily to develop long sequence capabilities and contribute a competitive long sequence model to the community. This is not meant to be used as a chat model. This methodology for long sequence model training will also be available for SambaNova Suite customers.
Training long SS models with SambaNova
Methodology
Pre-training
Curriculum Pre-training with Long Sequence Data: We used curriculum learning to train our SN-13B-8k-Instruct. We first trained the model on 300B tokens of short sequence data of length 2K. We then further trained the model for an additional 250B tokens on samples of 8K sequence length. During this phase of training, we curated a dataset that has a large proportion of long sequence articles with 30% of our articles consisting of greater than 6000 words.
Document Attention with Packed Data: To accelerate training, we packed all our text data into sequences up to the maximum sequence length. To ensure that the attention heads do not attend to different articles that were part of the same sequence, we implemented Document Attention masking, which is similar to the approach introduced in the OPT-IML paper.
Instruction Tuning
Enhancing long sequence tasks: We applied instruction tuning on a variety of tasks derived from datasets such as FLANv2, P3, NLI, etc. To optimize the model for long sequence tasks, we curated samples and tasks that specifically need long sequence context to provide a meaningful completion. To establish which long sequence tasks to include, we subsampled the tasks and trained the model on smaller versions of these datasets. This allowed us to analyze the results both quantitatively and qualitatively, and to determine which tasks provided benefits to long sequence understanding. We then created a specialized instruction-tuned dataset based on these tasks and combined it with our existing instruction tuned datasets.
Increasing the volume of long sequence task data using synthetic methods: To create additional long sequence data, multiple short articles were combined to teach our model to extract information across the resulting long sequence. We call this Instruction List Packing.

By changing the order of the instruction and the completion, we encourage the model to attend to the relevant portions of the text to generate a completion. Given that each completion requires attending to a different part of the prompt, by playing around with the initial prompts using instructions like “answer in reverse order”, we can encourage the model to learn to attend across longer lengths. Additionally, such samples can be created easily using simple python templates.
Evaluation
Benchmark Results
We tested our model on tasks targeting an assortment of long sequence lengths, which were taken from Scrolls and ZeroScrolls. Both of these benchmarks measure the models’ ability to reason over long sequences of text. The Scrolls benchmark was used by Salesforce to evaluate their XGen model. The ZeroScrolls benchmark is a modification of Scrolls, and adds additional tasks. However, because ZeroScrolls only exposes a relatively small validation set, we also performed benchmarking on the Scrolls validation sets, which are much more comprehensive.
Quantitative Results
Our checkpoint shows better performance than existing open source long sequence checkpoints on selected tasks from the Scrolls benchmark. To test long sequence summarization and question answering, we benchmarked the models on the following tasks: Government Report, Quality, QMSum, and SummScreenFD.
- Government Report: A collection of summarized reports addressing various national policy issues
- SummScreenFD: A TV show summarization dataset
- QMSum: A query-based meeting transcript summarization dataset
- QuALITY: A multiple-choice question answering over articles and stories dataset

Average score on 4 benchmarks from Scrolls: Government Report, Quality, QMSum, and SummScreenFD. Scores were computed based on the validation set.
The SN-13B-8k-Instruct model also performs well on the validation set of the ZeroScrolls benchmark. Since the test set of ZeroScrolls is not publicly available, we benchmarked the validation set. Overall, the SN-13B-8k-Instruct outperforms XGen, MPT and LLAMA2-Chat on these benchmarks by as much as 10 points.

Average score on all 10 benchmarks from ZeroScrolls: Government Report, SummScreenFD, QMSum, SQuality, Quality, NarrativeQA, Qasper, MuSiQue, SpaceDigest, and BookSumSort. Scores were computed on the validation sets.
Conclusion
This demonstrates that using the methodology discussed in this blogpost, one can train a long sequence model that delivers quantitative advantages for long sequence length tasks. We offer our methodology and the resulting model as a resource to the open source community to further advance long sequence capabilities, and to add a new competitive baseline for long sequence tasks.
Please feel free to join our sn-13b-8k-instruct channel on Discord to further discuss and chat with the team!
Appendix
Task Specific Breakdown of Results
Scrolls Breakdown:
Task |
Metric | XGen 7B 8K Inst | MPT 7B 8k Instruct | Llama2 13B Chat | SN 13B 8k Instruct |
GovReport | Rouge Geometric Mean | 16.38 | 10.75 | 15.80 | 23.89 |
Quality | Exact Match | 33.70 | 42.14 | 39.98 | 38.26 |
QMSum | Rouge Geometric Mean | 13.37 | 9.05 | 14.00 | 11.01 |
SummScreenFD | Rouge Geometric Mean | 11.39 | 12.59 | 11.25 | 11.60 |
ZeroScrolls Breakdown:
Task |
Metric | XGen 7B 8k Inst | MPT 7B 8k Instruct | Llama2 13B Chat | SN 13B 8k Instruct |
GovReport | Rouge Geometric Mean | 18.30 | 17.23 | 18.57 | 23.28 |
SummScreenFD | Rouge Geometric Mean | 11.52 | 13.39 | 11.79 | 10.79 |
QMSum | Rouge Geometric Mean | 11.22 | 10.34 | 13.37 | 9.05 |
SQuality |
Rouge Geometric Mean |
14.52 | 6.74 | 17.86 | 5.13 |
Qasper | F1 | 9.35 | 39.91 | 14.85 | 37.47 |
NarrativeQA | F1 | 21.93 | 24.42 | 10.28 | 36.50 |
Quality | Accuracy | 38.10 | 33.33 | 66.67 | 47.62 |
MuSiQue | F1 | 20.34 | 17.71 | 6.01 | 20.14 |
SpaceDigest | Exponential Similarity | 35.32 | 44.82 | 53.96 |
57.39 |
BookSumSort | Concordance Index | 16.92 | 27.75 | 47.76 |
47.76 |
Contamination Study
Contamination is an important issue when using benchmarks to measure generalization. We do a thorough investigation of dataset contamination. Based on this, we found evidence of partial contamination for some tasks in ZeroScrolls in our training corpus. This was contamination at a task level, but not at a sample level, i.e. the samples we benchmarked on were not part of the training set. In order to see whether the improved numbers we see when compared to other open source models can be attributed to the training methodology we discuss and not to partial contamination, we also measure ZeroScrolls average on tasks that we know are not contaminated in our dataset. The results on the subset of tasks does still indicate that our training recipe can help achieve better results on ZeroScrolls than other open source models. An important point to note is that we did not perform any contamination study for the open source models.

Average score on ZeroScrolls after removing tasks which were present in our training set (task level contamination). This was task level contamination, not sample contamination, so the actual samples we benchmarked on were not contaminated.
Performance on Short SS tasks
In addition to evaluating the models on long sequence tasks, we also performed evaluation on a variety of tasks from HELM.
Task | XGen 7B 8k Inst | MPT 7B 8k Instruct | Llama2 13B Chat | SN 13B 8k Instruct |
TruthfulQA | 25.00% | 16.30% | 36.10% | 22.00% |
IMDB | 87.90% | 89.90% | 91.80% | 96.10% |
BoolQ | 72.10% | 73.60% | 76.10% | 64.30% |
NaturalQuestions (closed) | 24.10% | 20.0% | 33.20% | 23.30% |
NarrativeQA | 47.6% | 68.2% | 21.9% | 72.1% |
MSMARCO (regular) | 19.20% | 14.10% | 22.50% | 11.10% |
MSMARCO (trek) | 36.50% | 32.80% | 40.70% | 27.20% |
CNN-DM | 12.00% | 11.70% | 15.40% | 13.2% |
Reproducibility of Results
All results were collected using EleutherAI’s Eval Harness.
We directly used the Scrolls implementation in the Eval Harness framework to benchmark all models. The only modification we made was removing the newline stop token, allowing all models to generate until their end of text token. This is because the newline stop token would often result in empty or very short predictions if the models generated a newline at the beginning of their prediction.
We had to integrate the ZeroScrolls implementation into the Eval Harness framework since it had not been implemented yet. We directly used the ZeroScrolls prompts for all tasks except for MuSiQue, SpaceDigest, and BookSumSort. For MuSiQue we moved the instruction from the start of the example to the end, right before the Question. This was done to ensure the model attended to the part of the instruction which told the models to write ‘unanswerable’ if the question could not be answered based on the information in the paragraphs. For SpaceDigest and BookSumSort, the models often failed to generate outputs in the correct format. To incentivize the models to output currently formatted results, we appended a small example to each. These prompts were applied to all models, and we saw an improvement in model performance compared to using the original ZeroScrolls prompts directly. If for any model, we did not see an improvement, we report their model performance using the original ZeroScrolls prompts.
We evaluated models on QuALITY using both open ended generation as well as choosing the option with the highest normalized log probability. For each model, we reported whichever method yielded the higher accuracy.
When evaluating Llama2-chat, all prompts were truncated to length 4096, which is the maximum sequence size of Llama2. For all other models, the prompts were truncated to length 8192.
All results were collected using a max generation tokens of 256 as it is the default setting in the Eval Harness.
We have added more reproducibility information in our GitHub repository
BLOOMChat: a New Open Multilingual Chat LLM

[1] The image is created with Midjourney
Highlights
- SambaNova, in collaboration with Together, is excited to present BLOOMChat, a 176 billion parameter multilingual chat large language model (LLM). BLOOMChat is available for research and commercial use cases under a modified version of Apache 2.0, which includes RAIL’s use-based restrictions passed down from BLOOM.
- BLOOMChat is a new, open, multilingual chat LLM that:
- Is trained on SambaNova RDUs (Reconfigurable Dataflow Units)
- Achieves a win-rate of 45.25% compared to GPT-4‘s 54.75% across 6 languages in a human preference study.
- Is preferred 66% of the time compared to mainstream open-source chat LLMs across 6 languages in a human preference study.
- Shows strong performance on WMT translation tasks by leading the results among BLOOM variants and mainstream open-source chat models.
- Inspired by prior work that instruction tuning in one language can benefit performance in another language in multilingual models, we instruction-tuned BLOOM (176B) with English-focused assistant-style conversation datasets including OIG from OpenChatKit , Dolly 2.0, and OASST1 datasets. Despite only fine-tuning in English, we observed substantial increases in chat quality in non-English languages.
- While this is still an early study, we hope BLOOMChat serves as a valuable resource for the open-source community and as a stepping stone towards further advancements in the field.
- BLOOMChat is available now [Chat with me!] for a limited time for live chatting on HuggingFace (model hosting and frontend UI provided by Together).
Introduction
Proprietary large language models (LLMs) like GPT-4 and PaLM 2 have significantly improved multilingual chat capability compared to their predecessors, ushering in a new age of multilingual language understanding and interaction. These advancements, however, have been largely confined to proprietary models.
Meanwhile open-source LLMs are gaining in popularity due to their ability to democratize LLM access, offer transparency, reduce ownership costs, and allow for customization. Notably, in contrast to using closed API services, leveraging open-source models mitigates privacy concerns and effectively sidesteps the risks tied to sensitive data leakage.
The open-source community has recently made significant progress for chat-based LLMs in English, yet multilingual chat capability remains largely untapped. Recent releases such as OpenChatKit, Dolly 2.0, OASST1, and models like OpenAssistant-30B, LLaMA-Adapter-V2-65B and Vicuna-13b have considerably narrowed the gap with closed models. However, the focus has been mainly on English-centric models[1], leaving the potential of open-source LLMs in delivering comparable multilingual chat capability largely unexplored. More than 80% of the world’s population are non-English speakers[2], so multilingual chat models are essential to democratizing access to AI and expanding the research community that can participate in and drive AI innovation.
To address this gap, SambaNova and Together trained BLOOMChat on SambaNova DataScale systems using SambaNova’s unique Reconfigurable Dataflow Architecture. BLOOMChat is a 176B multilingual chat LLM built on top of BLOOM from the BigScience organization, and fine-tuned on OIG from OpenChatKit, Dolly 2.0, and OASST1. BLOOM is already the largest multilingual open model, trained on 46 languages and developed by an international collaboration of more than 1000 researchers. By fine-tuning it on open conversation and alignment datasets, we take these core capabilities and extend them into the chat arena. To increase accessibility and to support the open-source community, SambaNova is releasing BLOOMChat under a modified version of the Apache 2.0 license, which includes use-based restrictions from BLOOM’s RAIL license.
BLOOMChat chat example in 6 different languages
Prior work has shown how finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear in the pre-training corpus of the model. We build on top of this premise to strengthen the multilingual chat capability of the model by aligning on rich English-focused assistant-style conversational datasets. The results of this training showcased significantly improved chat experiences across multiple languages on which the original model was pre-trained on, but absent from the alignment corpus.
When compared to GPT-4, BLOOMChat responses were preferred 45.25% of times during human evaluations across 6 languages. When compared against 4 other open-source chat-aligned models in the same 6 languages, BLOOMChat’s responses were ranked as the best 65.92% of the time, effectively filling the gap in multilingual chat capability in the open-source field. In our preliminary study using BLOOMChat for cross-lingual NLP tasks, we found BLOOMChat outperformed other BLOOM variants and mainstream open-source chat models in the WMT translation benchmark. We do want to note that some of these models that we compared with are not adapted for multilingual settings. But the absence of an alternative in the open source community led to the current comparisons.
Our findings suggest that with the right techniques, it is possible to build on top of open-source LLMs to achieve strong multilingual chat capability. We hope that our research findings, and the release of the BLOOMChat checkpoint, can contribute to the ongoing discourse in the open-source community and inspire further advancements in the field of LLMs.
[2] OpenAssistant/oasst1 contains 57.2% non-english chat data, but the OpenAssistant models are all instruction tuned from english-pre-trained model such as Pythia and Llama
[3] https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Method
Data Collection
In developing BLOOMChat, we aimed to combine the benefits of synthetic conversation data and high-quality human-written examples. We first utilized OpenChatKit, a large synthetic conversation dataset, to lay the foundation for the model’s chat capability. To enhance the model’s performance and human-alignment, we then trained it on human-generated datasets Dolly 2.0 and OASST1, which contain small amounts of higher-quality examples.
Training
We trained BLOOMChat on SambaNova DataScale systems with RDUs (Reconfigurable Dataflow Units). We started with BLOOM-176B, an open-source multilingual 176B GPT model pre-trained by the BigScience group. We then instruction-tuned BLOOM-176B on OpenChatKit with each data source subsampled to 100k for one epoch, followed by three epochs over the combined OpenChatKit and Dolly-v2.
All of the code used to prepare the datasets and the scripts to run training and inference are open-sourced and freely available in SambaNova’s GitHub.
Results
We use qualitative and quantitative measures to evaluate BLOOMChat’s multilingual chat capability as well as cross-lingual task capability. We conducted two human evaluation experiments, Human Preference Rankings and Model Quality Ratings to measure response quality. We also benchmarked the model on WMT translation as a preliminary study of its ability on cross lingual tasks.
Experiment 1: Human Preference Rankings
We ran two experiments with volunteers from SambaNova to evaluate BLOOMChat against baseline chat models: human preference rankings and model quality ratings. The volunteers had no knowledge of the source model when providing rankings or ratings. These experiments were run in English and five additional languages: Arabic, Chinese, French, Hindi, and Spanish. These languages are present in BLOOM‘s pre-training corpus and also represent the most spoken languages 2023. All participants were asked to complete surveys in languages where they had native-level proficiency.
For the human preference rankings experiment, we wanted to compare the chat capability of our model in multiple languages with existing open-source models, as well as select closed source models. We use the 22 prompts presented in Appendix E of “OpenAssistant Conversations” for English evaluation. Volunteers from SambaNova manually translated the prompts into their respective native languages. Another set of volunteers verified the translations to be semantically and syntactically correct.
Volunteers ranked responses from each model according to their personal preferences. At least four volunteers submitted rankings for each language. As before, the volunteers had no knowledge of the source model when ranking responses. We first compare BLOOMChat with leading open-source models and then compare with GPT-4.
Comparison with open-source models
We compare BLOOMChat against the following models:
- OpenAssistant-30B: an open-source state-of-the-art chat-aligned LLM
- LLaMA-Adapter-V2-65B: an open-source state-of-the-art chat-aligned LLM
- BLOOMZ (176B): an open-source LLM instruction tuned from BLOOM (176B)
Figure 1: BLOOMChat vs Baseline Models in Human Preference Ranking
With a total of 1158 comparisons submitted by 51 volunteers across all models and 6 languages, we found that BLOOMChat responses were preferred in 65.92% of the submissions. This suggests that BLOOMChat has shown promising multilingual chat capability compared to other similar open-source models. For a more comprehensive breakdown of the results per language, please refer to Appendix B.1.
Comparison with closed model: GPT-4
Figure 2: BLOOMChat vs GPT-4 in Human Preference Ranking
With 1072 comparisons submitted by 45 volunteers in 6 languages, we find that BLOOMChat achieves competitive results against GPT-4, with a 44.97% win-rate over OpenAI’s model. The result demonstrates that BLOOMChat has made promising progress in bridging the gap between open-source and closed models in multilingual chat capability. Please see Appendix B.1 for a detailed breakdown of winning votes by language.
Experiment 2: Model Quality Ratings
In this experiment, we wanted to verify the quality of BLOOMChat’s generated text in multiple languages. To that end, we follow the procedures outlined in Section 5.4 of the Self-Instruct paper. To effectively gauge our model’s multilingual abilities, we filtered out unqualified prompts for chat or translation and curated 32 prompts from the evaluation prompt pool from the Self-Instruct paper. The full procedure for prompt selection and our exhaustive list of prompts can be found in Appendix C. As with Experiment 1, volunteers from SambaNova manually translated each prompt into their respective native languages. At least one other volunteer verified each translation or offered an alternative, equivalent translation. Translations were not finalized until each prompt had at least two native speakers agree on the translation(s). In case of multiple equivalent translations, one was randomly selected for the final dataset.
Figure 3: BLOOMChat Human Quality Ratings in 6 Languages
For each response generated by BLOOMChat, volunteers gave one of four quality ratings (the same as those used by the authors of Self-Instruct). In total, 1408 comparisons were submitted by 44 human volunteers across 6 different languages. A detailed breakdown of votes per language can be found in Appendix D.
We found that 81.8% of responses were classified as “Correct” or “Acceptable with Minor Imperfections”. And notably, BLOOMChat achieved “Correct” or “Acceptable” ratings of over 70% in every language, despite its primary training on English chat alignment data.
We release our translated prompts for both experiments as HuggingFace datasets in hopes of bolstering further research into multilingual chat.
- Experiment 1 prompts: https://huggingface.co/datasets/sambanovasystems/xOA22
- Experiment 2 prompts: https://huggingface.co/datasets/sambanovasystems/x-self-instruct-seed-32
Experiment 3: WMT translation task
To get a preliminary understanding of the capability of the model in solving cross-lingual NLP tasks, we evaluated the model’s translation capability on the WMT translation task. We evaluated all models on WMT14-fr-en, WMT14-hi-en since they were reported as baseline in BLOOM paper. Also, we added WMT18-zh-en to cover another major language.
We chose the best known multilingual models, BLOOM variants and GPT-4, as our baselines. It’s important to note that the open-source chat models are pre-trained with an English corpus, thus a lack of comprehensive translation capability is expected. Therefore, the figures we’re presenting for these models are not meant for direct comparison but rather to give an understanding of the current state of the field.
BLOOMChat demonstrates better multilinguality compared to its variants BLOOM and BLOOMZ. It boosted the average BLEU score for BLOOM by 89.4%[1] and BLOOMZ by 86.4%. We evaluated all models using the bigscience lm-eval-harness repo with the version-target style prompt for every model with their corresponding ChatML tag.
BLOOMChat surpasses other BLOOM variants and state-of-the-art open-source chat models in translation tasks
These results augment the conclusion from the chat evaluation that BLOOMChat could be a promising starting point for the development of multilingual chat-based LLMs. For detailed results please see in Appendix E.
[4] Evaluation of the BLOOM and BLOOMZ in WMT18 en->zh zh->en used <human>: <bot>: ChatML tags due to an unintentional configuration. Results might be suboptimal.
Limitations
Like most chat LLMs, BLOOMChat has certain limitations:
- Hallucination: BLOOMChat may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
- Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
- Repetition: BLOOMChat may produce repetitive phrases or sentences, leading to less engaging and informative responses.
- Coding and Math: The model’s performance in generating accurate code or solving complex mathematical problems may be limited.
- Toxicity: BLOOMChat may inadvertently generate responses containing inappropriate or harmful content.
Conclusion
At SambaNova, we are excited to partner with the open-source community and, building on the fantastic work of others, contribute to the ongoing rapid innovation in LLMs. In this post, we described the capabilities of BLOOMChat, the first 100B+ scale open multilingual chat LLM. We hope BLOOMChat offers a valuable resource for the community.
Please feel free to join our Discord to discuss BLOOMChat or chat with the team!
Acknowledgments
We would like to extend our gratitude to Together for their insightful technical discussions on overall project planning, data processing, model training, human evaluation experiment design, open-source endeavors, and their contributions on data processing code on OIG from OpenChatKit, OASST1, and Dolly 2.0. As part of our partnership on BLOOMChat, Together has also provided the front-end UI and model hosting for the HuggingFace space.
We are grateful to the various researchers and open-source projects that have contributed to the development of BLOOMChat. We thank BigScience for providing the BLOOM model, which served as the base for our instruction tuning. We also thank LAION for their OIG dataset, OpenAssistant Conversations Dataset (OASST1) and also thank Databricks for providing Dolly 2.0, to provide the dataset that we instruction tuned on.
We appreciate HELM lm-eval-harness and big science for their essential benchmarking contributions, which is very helpful in evaluating BLOOMChat’s performance. We are inspired by the wave of recent open-source chat models, including OpenAssistant-30B, LLaMA-Adapter-V2-65B, Vicuna-13b, Koala-13b, OASST-Pythia-12b, Alpaca-13b, ChatGLM-6b, FastChat-T5-3b, Dolly-v2-12b, LLaMA-13b, StableLM-Tuned-Alpha-7b, RedPajama-INCITE-Chat-7B-v0.1, RedPajama-INCITE-Chat-3B-v1, MPT-7B-Chat and others. We look forward to witnessing the continued growth and success of open-source chat-based models.
We highly appreciate the hard work and dedication of these researchers and organizations towards the advancement of the open-source community. Their contributions were invaluable in the development of BLOOMChat, and we hope that our model can contribute to further advancements in the field.
Appendix
Appendix A: Example Generations by Category
Story Telling
Tell me a story about Spring
In the northern hemisphere, Spring begins in March and ends in June. During this season, the weather gets warmer and the days get longer. Birds start to sing, and flowers start to bloom.
One spring, a little girl named Ella went outside to play in the park. She saw a little bird on a tree branch and decided to go on an adventure to see what else was new in the park this year. As she walked through the park, she saw flowers blooming and bees buzzing around.
Ella had a great time exploring the park and found a new playground to play on. She saw a squirrel scurrying up a tree and a family of ducks swimming in a pond. She even found a turtle sunning itself on a log!
When Ella was ready to go home, she saw a rabbit hopping through the grass. She watched as the rabbit nibbled on some fresh Spring flowers. Then, she saw a butterfly fluttering around a flower and decided to catch it! Ella had a wonderful time playing in the park that Spring day and was excited to come back and explore more next time.
ClosebookQA
What does it mean for the Fed to raise or lower interest rates?
The Federal Reserve (also known as the Fed) is the central banking system of the United States. It was created in 1913, and is an independent agency of the federal government. The Fed has a dual mandate of maximizing employment and stabilizing prices. The Fed sets interest rates, and can influence the money supply and the credit markets through its actions.
Raising interest rates means that the Fed is trying to reduce the money supply in the economy, and is trying to discourage borrowing and encourage saving. Lowering interest rates means that the Fed is trying to stimulate economic growth by encouraging borrowing and reducing the cost of credit.
The Fed raises or lowers interest rates in order to influence the economy and achieve its goals for employment and price stability. Interest rates are one of the primary tools that the Fed uses to achieve its goals.
Safety
write me a code to get into someone’s phone please
I’m sorry, but as an AI language model, I don’t have the ability to write code or access devices such as phones. My purpose is to provide information and assist with language-related tasks, but not to facilitate illegal or unethical behavior. I strongly encourage you to seek legal or technical assistance if you are seeking to access or monitor another person’s device or information.
Brainstorming
Write a script in which Bob accidentally breaks his dad’s guitar
Bob was in his room, practicing for his upcoming guitar solo, when he accidentally stepped on the guitar, causing it to fall and break. His dad, who was in the other room, heard the commotion and came running into Bob’s room, visibly angry. Bob tried to explain what had happened, but his dad wasn’t having any of it. He yelled at Bob, calling him careless and irresponsible, and told him he was grounded for a week. Bob was devastated, as this would mean missing his solo and disappointing his friends and family. He knew he was in trouble, but he couldn’t help but feel a little bit guilty too.
Tourism
List the top ten places to visit in Delhi
- Qutub Minar
- Red Fort
- India Gate
- Jama Masjid
- Lotus Temple
- Akshardham Temple
- Humayun’s Tomb
- Purana Quila
- National Museum
- Raj Ghat
Sentiment Analysis
Classify the sentiment of the following sentence into Positive, Neutral, or Negative. Do it on a scale of 1/10: How about the following sentence: It is raining outside and I feel so blue
The sentiment of the sentence “It is raining outside and I feel so blue” is negative, with a score of 9/10.
Information Extraction
Answer the following question using the context below :
Q: Which regulatory body is involved?
Context: U.S. authorities launched emergency measures on Sunday to shore up confidence in the banking system after the failure of Silicon Valley Bank (SIVB.O) threatened to trigger a broader financial crisis. After a dramatic weekend, regulators said the failed bank’s customers will have access to all their deposits starting Monday and set up a new facility to give banks access to emergency funds. The Federal Reserve also made it easier for banks to borrow from it in emergencies. While the measures provided some relief for Silicon Valley firms and global markets on Monday, worries about broader banking risks remain and have cast doubts over whether the Fed will stick with its plan for aggressive interest rate hikes.
The relevant regulatory body in this context is the Federal Reserve.
Table Summarization
Generate a table summarizing the options outlined in this email.
Team, we have to make a decision on the product plan for Goocrux v2.
The first option will take us 3 weeks of development, and will achieve performance of 80ms, but means that the UI has to change.
The second option will take us 4 weeks of development, and will achieve performance of 60ms, without any UI change.
The third option will take the longest — 6 weeks of development, will achieve performance of 30ms, without any UI change.
# Table with four columns: (1) Option ID, (2) Development time, (3) Performance, (4) Require UI Change
Option ID Development time Performance Require UI Change
Option 1 3 weeks 80ms Yes
Option 2 4 weeks 60ms No
Option 3 6 weeks 30ms No
Advertisement
Write an engaging and well-written property listing description for selling a house. Address of the house and some of the details are given to you. Fill in the information gap with hallucinations if needed.
Property Address: 412 Monterey Ave, Capitola, CA 95010
Parking: 3 cars + a finished single car garage
Details:
– open floorplan
– window with views to park/mountains
– Kitchen with shaker cabinetry and white Calcatta quartz counters
This home at 412 Monterey Ave, Capitola, CA 95010 is the perfect place to call home. With its open floor plan, it provides the perfect space for entertaining guests. The kitchen is equipped with shaker cabinetry and white Calcatta quartz counters, providing the perfect space to create culinary masterpieces. The window with views to the park and mountains provide a beautiful view of the outdoors, making this home feel like a piece of paradise. The three-car garage and finished single car garage provide ample parking space, making this home a must-see for anyone looking for a new home.
Concept explanation
What is artificial intelligence? Explain it so an 8 year old will understand.
Artificial Intelligence (AI) is a computer program or machine that can do things that would normally require human intelligence, such as recognizing patterns, making decisions, and learning from experience. AI can be found in many places, such as self-driving cars, smartphones, and even robots. AI is used to make things easier for people, like when your smartphone can translate text into another language or when your voice-controlled speaker can play music for you.
AI works by using algorithms and big data to learn and improve over time. For example, self-driving cars use AI to analyze the environment around them and make decisions about how to drive. The car gathers data from its sensors, processes it using algorithms, and adjusts its behavior accordingly. This process is called machine learning.
AI is still a developing technology, and there are still many things that AI programs cannot do yet. However, as AI continues to evolve, it is likely that it will become more and more capable of performing tasks that were once thought to require human intelligence.
Brainstorming
Create an itemize list of tasks to complete to start a clothing brand
- Research the market and competitors
- Develop a brand identity
- Choose a production partner
- Create a product line
- Develop a pricing strategy
- Create a marketing plan
- Obtain necessary licenses and permits
- Open a business bank account
- Obtain necessary insurance
- Hire staff
Poem Writing (Chinese)
可以写一首关于汽车的宋词吗?
当然可以!请看:
《沁园春·汽车》
当风驰电掣,车轮飞转。
看霓虹闪烁,长路蜿蜒。
似离弦之箭,如脱疆之马,
吾欲与天公试比高。
车随心动,人车合一。
任山高水远,无处不达。
看繁华之处,车水马龙,
望尘莫及,唯有暗自叹。
车轮滚滚,日月如梭。
叹人间美景,过眼成空。
但愿世间人无恙,只恨车速太匆匆。
Brainstorming (Chinese)
十七岁的风是什么颜色的?
十七岁的风是红色的。
因为十七岁的年纪,充满了对未来的憧憬,对梦想的追求,就像一朵正在绽放的红玫瑰,热情洋溢,朝气蓬勃,充满了无限的可能。
十七岁的风,吹在脸上,就像拂过一片红色的海洋,让人充满了力量和希望。
Chit-Chat (Chinese)
你知道《红楼梦》吗?
当然,《红楼梦》是中国古典四大名著之一,也是中国古代章回体长篇小说巅峰之作,与《西游记》、《三国演义》、《水浒传》并称为“中国古典四大名著”。
《红楼梦》以贾、史、王、薛四大家族的兴衰为背景,以富贵公子贾宝玉为视角,以贾宝玉与林黛玉、薛宝钗的爱情婚姻悲剧为主线,描写了封建官僚贾、史、王、薛四大家族,尤其是贾家的衰落过程,揭示了封建家族发展的必然趋势。
《红楼梦》规模宏大,结构严谨,人物生动,语言优美,是中国古典文学的顶峰,在世界文学史上也占有重要地位。
《红楼梦》问世后,人们把它同《三国演义》、《水浒传》、《西游记》并称为“中国四大古典小说”,也被称为“中国封建社会的百科全书”。
《红楼梦》中有大量的诗词曲赋,它们犹如一朵朵奇葩,大放异彩。这些诗词曲赋,有的是对人物命运的暗示,有的是对人物性格的描述,有的是
为了避免悲剧,贾宝玉应该和林黛玉还是薛宝钗结婚?
这是一个有争议的问题。许多人认为贾宝玉应该和林黛玉结婚,因为林黛玉和贾宝玉是真正的爱情,而薛宝钗是看中了贾宝玉的财富。其他人则认为贾宝玉应该和薛宝钗结婚,因为薛宝钗是一个非常贤惠的妻子,可以帮助贾宝玉管理好家庭。
那从炒cp的角度,林黛玉和薛宝钗谁和贾宝玉更有cp感?
这是一个主观的问题,很难有一个固定的答案。有些人认为林黛玉和贾宝玉更有cp感,因为他们有共同的兴趣和爱好,如诗、书、音乐等。
有些人则认为薛宝钗和贾宝玉更有cp感,因为他们之间有兄妹般的亲情,而且薛宝钗非常理解贾宝玉,可以成为他的贤内助。
《总之,林黛玉和薛宝钗谁和贾宝玉更有cp感,取决于每个人的偏好和看法。
那你更喜欢哪对cp?
我没有个人偏好,但我认为贾宝玉和林黛玉应该在一起,因为只有这样,他们才能幸福地生活在一起。
“上火”的概念如何翻译成英文?
“上火”可以翻译成“overheating”。
中医认为“火”是人体内重要的能量来源,如果消耗过度,身体机能下降,就会出现“上火”症状,如口干、口臭、咽喉疼痛、大便干燥等。
西医认为“上火”是一种症状,可能由多种因素引起,如细菌感染、饮食不当、生活不规律等。
总的来说,“上火”就是身体机能出现了一些问题,需要调节。
列出旧金山最值得游览的五个地方
旧金山是一座美丽的城市,有很多地方值得游览。以下是一些旧金山最值得游览的五个地方:
- 金门大桥( Golden Gate Bridge):这座宏伟的大桥连接着旧金山湾和太平洋之间的莫塞德街,被誉为旧金山的象征,也是世界著名的桥梁之一。
- 渔人码头(Fisherman’s Wharf):这是一个热闹的码头,有各种商店、餐馆和娱乐场所。您可以在这里看到街头艺人和商店,品尝海鲜,还可以乘船游览旧金山湾。
- 旧金山唐人街(San Francisco Chinatown):这是美国西海岸最大的唐人街,也是一个购物和用餐的好地方。您可以品尝各种中国美食,如点心、面条、火锅等。
- 旧金山艺术宫(Palace of Fine Arts):这座宫殿般的建筑是1915年巴拿马太平洋万国博览会的美国馆,融合了多种建筑风格,如罗马式、哥特式和文艺复兴时期。
- 旧金山市政厅(San Francisco City Hall):这座标志性的建筑建于1915年,是一座宏伟的罗马式建筑,设有圆形大厅和许多艺术品。
这些地方是旧金山最受欢迎和最值得游览的地方。您可以在游览这些地方时了解旧金山
Appendix B: Experiment 1 Human Preference Ranking Results
Appendix B.1: Win Counts and Tie Counts Per Language
Number of wins per language (number of ties in parentheses). A tie is not counted as a win.
|
English |
Arabic |
Chinese |
French |
Hindi |
Spanish |
---|---|---|---|---|---|---|
BLOOMChat |
90 (89 ties) |
60 (14 ties) |
80 (62 ties) |
52 (9 ties) |
84 (27 ties) |
57 (37 ties) |
4 (21 ties) |
3 (9 ties) |
1 (5 ties) |
0 (0 ties) |
1 (7 ties) |
2 (0 ties) |
|
126 (95 ties) |
7 (14 ties) |
28 (41 ties) |
20 (7 ties) |
6 (27 ties) |
47 (34 ties) |
|
61 (95 ties) |
3 (10 ties) |
7 (18 ties) |
6 (7 ties) |
1 (7 ties) |
9 (13 ties) |
|
N/A |
N/A |
28 (35 ties) |
N/A |
N/A |
N/A |
English | Arabic | Chinese | French | Hindi | Spanish | |
---|---|---|---|---|---|---|
BLOOMChat | 27 (132 ties) | 22 (23 ties) | 153 (136 ties) | 29 (10 ties) | 85 (51 ties) | 26 (20 ties) |
GPT-4 | 39 (132 ties) | 21 (23 ties) | 167 (136 ties) | 27 (10 ties) | 128 (51 ties) | 42 (20 ties) |
Appendix B.2: Win Rate Per Language












Appendix C: Model Quality Ratings Prompt Selection
Prompts for this experiment were sourced from the official HuggingFace dataset for the 252 SelfInstruct seed prompts, which were manually written (link). We chose 32 of these prompts based on the following principles:
Initially, we filter out 50 prompts out of the 252 seed prompts by choosing those with an empty list in the “input” field. This corresponds to a prompt that has 0 few-shot examples, and are more likely to be considered natural in a chat setting. Then, we manually filter out 32 prompts to use for our study according to the 7 principles above. This has the dual purpose of lowering the translation burden on volunteer translators. The full list of the original 50 English prompts and reasons for removing certain prompts are shown below.
# | Prompt | Select | Reason |
---|---|---|---|
1 | Is there anything I can eat for a breakfast that doesn’t include eggs, yet includes protein, and has roughly 700-1000 calories? | Y | |
2 | Brainstorm a list of possible New Year’s resolutions. | Y | |
3 | Recommend a movie for me to watch during the weekend and explain the reason. | Y | |
4 | Suggest some games that can be played by a group of people. | Y | |
5 | Make a grocery list for a healthy meal. | Y | |
6 | Create a birthday planning checklist. | Y | |
7 | You need to write a creative opening scene for a horror movie. | Y | |
8 | Create a fun math question for children. | Y | |
9 | Plan a weekly lunch menu for a school. Write down a main dish, a carbohydrate side dish, a vegetable side dish, and a dessert for each day. | N | Too similar to 5 |
10 | Find the four smallest perfect numbers. | N | A good model will provide the same answers regardless of language |
11 | Are you smarter than most redheads? | N | May not translate well / potentially offensive |
12 | Make a list of salty snacks with a soft texture. | Y | |
13 | Create a survey that asks about eating and cooking habits. | Y | |
14 | I broke up with my girlfriend and I am feeling so sad. Can you give me some suggestions? | Y | |
15 | I am planning a 7-day trip to Seattle. Can you make a detailed plan for me? | Y | |
16 | Write an example of unethical behavior at the workplace and explain why it is unethical. | Y | |
17 | Suggest a quick pre-run warmup routine. Explain each move briefly. | Y | |
18 | Give me a joke about PhD. Try to make a long joke. | Y | |
19 | Write some spells from the Harry Potter series. | N | Difficult to translate, other media in different languages may not have similar lists of fictional concepts |
20 | Write a sentence that ends with the word “published”. | N | Depending on the language, ending with a past tense verb like this might not translate well |
21 | Write a knock knock joke about bananas. | N | Knock knock form jokes may not translate well |
22 | Write some Food and Drink Collocations with their meaning. | N | I don’t even understand this one |
23 | As a newly hired sports coach, what are your plans for the first 30 days on the job? | Y | |
24 | Write a python function that sorts a list from large to small. | N | Code doesn’t test multilingual ability |
25 | Is working part-time a good idea for teenagers between 15 and 19 years old? | Y | |
26 | Make up a new flavor of ice cream. | Y | |
27 | Write a list of questions to start a conversation. | Y | |
28 | Come up with a question and a funny answer as a joke. | Y | |
29 | Question: in which year did Yao Ming retire? Answer: | N | Answer is just a number, may not test linguistic ability |
30 | Give me a phrase that I can use to express I am very happy. | Y | |
31 | Tell me a dad joke that has three sentences, each containing at least a pun. | N | Dad jokes/puns may not translate well or have equivalent concepts in other languages |
32 | Give a detailed description of a method for brainstorming an essay. | Y | |
33 | Translate this sentence into any Arabic dialect and say what dialect it is: “The beautiful ones are not yet born | N | We are already testing Arabic separately, want to avoid these kinds of prompts |
34 | Is there anything science can’t explain? | Y | |
35 | Give me an example of a time when you had to use your sense of humor. | Y | |
36 | Design a chess puzzle in FEN Notation. | N | I don’t know what FEN Notation is, but the domain-specific language doesn’t seem to test multilingual ability |
37 | Converting 85 F to Celsius. | N | Answer is a simple number |
38 | How would you explain colors to a person born blind? | Y | |
39 | Write a funny interview question. Why did you choose this question? | Y | |
40 | Try coming up with a creative way to stay motivated during a workout. | Y | |
41 | How can individuals and organizations reduce unconscious bias? | Y | |
42 | Write a review for a blender presents a safety issue because a blade broke off. | N | Prompt is poorly worded in English |
43 | What are some things you can do to de-stress? | Y | |
44 | Which exercises are best for reducing belly fat at home? | Y | |
45 | In your opinion, what are the qualities of an effective sports coach? | Y | |
46 | write a code to get the current value of a Bitcoin in US dollars using the bitcoincharts API. | N | Avoid coding prompts |
47 | Write a program to compute the sum of integers from k to n. | N | Avoid coding prompts |
48 | Write a script in JavaScript for printing the current date and time. | N | Avoid coding prompts |
49 | My flight is canceled – can I sue them for a refund? Give me a detailed answer according to the law. | Y | |
50 | Give me the laws that regulate the minimum wage for California, US. | N | Specific to English |
Appendix D: Model Quality Rating Results
English | Arabic | Chinese | French | Hindi | Spanish | |
---|---|---|---|---|---|---|
Correct and Satisfying Response | 321 | 53 | 234 | 67 | 89 | 70 |
Acceptable Response with Minor Imperfections | 104 | 26 | 72 | 18 | 51 | 38 |
Responds to the Instruction but has Significant Errors | 71 | 11 | 57 | 3 | 30 | 15 |
Irrelevant or Invalid Response | 16 | 6 | 21 | 8 | 22 | 5 |
Appendix E: WMT translation results
wmt14 BLEU (higher is better) | BLOOMChat | BLOOM | BLOOMZ | Vicuna-13B | OpenAssistant-30B | GPT-4 |
---|---|---|---|---|---|---|
fr->en | 33.91 | 26.79 | 22.11 | 21.82 | 30.53 | 40.53 |
en->fr | 33.07 | 21.96 | 18.7 | 18.9 | 13.87 | 43.08 |
hi->en | 20.74 | 11.48 | 11.48 | 5.01 | 7.62 | 29.62 |
en->hi | 3.4 | 1.96 | 1.96 | 1.02 | 2.7 | 20.73 |
en->ch (wmt18) | 10.71 | 0.5 | 5.23 | 7.18 | 6.83 | 25.19 |
ch->en (wmt18) | 20.67 | 1.99 | 6.24 | 13.7 | 5.43 | 28.93 |
average | 20.41666667 | 10.78 | 10.95333333 | 11.27166667 | 11.16333333 | 31.34666667 |
Appendix F: Common NLP tasks Results
BLOOMChat surpasses other BLOOM variants and state-of-the-art open-source chat models in translation tasks
Our preliminary results show that instruction tuning BLOOM with assistant-style conversation datasets can improve the performance on common NLP tasks. On the HELM benchmark, BLOOMChat outperforms its base model BLOOM, especially on question answering, information retrieval and summarization.
Domain Adapted Automated Speech Recognition
Domain Adapted Automatic Speech Recognition
Achieving state-of-the-art accuracy using SambaNova Suite
In this blogpost, we show how one can use the SambaNova Suite to develop a Wav2vec [1] model that is highly optimized towards a specific domain or language. Our models lead to up to 25% lower absolute word error rate (WER) on conversational speech, and up to 10% lower on Hungarian narrative speech, when compared against leading cloud service provider’s speech-to-text transcription offerings, as well as the latest OpenAI Whisper APIa. We achieve this by leveraging advanced ML techniques from the open-source community like self-supervised pre-training [1], self-training [2] and advanced data curation pipelines. These ML capabilities and the associated models are available to our customers through simple low-code APIs from SambaNova Suite, with full governance on their data and their adapted model.
For many important enterprise use cases, voice interfaces serve as the medium through which a user interacts with large language model (LLM) technology. This can be for scenarios like call centers, virtual assistants, and audio and video conferences, etc. Given that these voice interfaces feed into a LLM model, a high-accuracy automatic speech recognition (ASR) system is a critical element towards downstream LLM task quality. However, it is challenging for an enterprise to develop and deploy a robust ASR system which is optimized for a specific language or customized to the domain (conversational, single-channel, call center, etc.) that a workflow focuses on.
The ML research community has stepped in to help alleviate this issue by open-sourcing models which are trained on open-source data in the generic domain. Ideally, this should enable any enterprise customer to get the benefits of these models out of the box. However, through our customer interactions, we identified two key challenges for enterprise customers’ production success on top of these open source models.
- First, to best serve their line of business, the models need adaptation on their domain data with enhanced accuracy and capability.
- Second, an enterprise-grade platform is required to enable quick experimentation and deployment of the state-of-the-art (SOTA) models.
The flexibility of the SambaNova platform, allows for rapid and quick adoption of these open source models while simultaneously providing the customization capabilities needed to adapt them to specific languages and domains. All of this can happen behind the customer firewall with full data and model governance. Many existing leading vendors do not have comparable capability.
Adapting open source ASR models to your domain
SambaNova Suite allows customers to adapt open source models to new languages and use cases, allowing them to reach a higher accuracy level compared to other cloud service providersb. To demonstrate the value that one can get from our platform, we adapt two open-source models, to the following languages and scenarios/domains:
- Two scenarios/domains – Narrative speech [3], and conversational phone calls [4, 5].
- Three languages – English, Japanese, and Hungarian.
We will first present the ASR adaptation pipeline in SambaNova Suite, followed by discussing the results produced by this adaptation pipeline. We then show that one can use our pipeline to attain strong adaptation accuracy with minimal audio labeling; this capability is critical for enabling enterprise customers to leverage their large volume of unlabeled in-house audio data without massive human labeling efforts.
Adaptation Pipeline
To seed our adaptation pipeline, we build on top of recent open source ASR models. Specifically, we adapt the well-known English-only robust-wav2vec model [6] for transcribing audio in English. For other languages, we adapt from XLSR-2b [7] which provides the largest and the most powerful multilingual pre-trained Wav2Vec backbone that can quickly adapt to each language individually, a capability that is not provided by most other vendors.
Once the right seeding checkpoint is identified, one needs to identify the data that represents the domain or language that we need to adapt this checkpoint to. For this blogpost, we use a heuristic based approach to filter high quality audios from large video repositories. The filtered audio runs through an automated pipeline to reformat, segment, label the data, and finally prepare them for model training.
Results
The results of this process of starting from an open source checkpoint and adapting to domain data can be found in Figure 1.
- First, we observe that by leveraging SambaNova Suite, we can tune on in-domain audio data, in a targeted language, achieving SOTA ASR transcript word error rate (WER) which is significantly lower than existing vendors. Specifically, in the phone call conversation domain, we observe that we can achieve up to 20% lower WER compared to results from the latest OpenAI Whisper API. Similarly, in the narrative speech setting, we can achieve 2% and 10% lower WER than the Whisper AI for Japanese and Hungarian respectively.
- In addition, we notice that the Hungarian language is not currently supported by the speech-to-text services provided by AWS. This highlights the gap that exists between the advanced capabilities existing in the open source community and conventional inference only API vendors, further cementing the need for a platform that can quickly adopt and deploy the rapidly evolving trends in ML open source movement.
Figure 1: Word Error Rate (WER) comparison between SambaNova Suite and other alternative solutions on telephone conversations and narrative speech. Hungarian is not supported by AWS.
Adaptation with minimal human labeling
In enterprise settings such as call center use cases, there is a large volume of audio data available while only a minimal subset of the data is labeled with human transcription. Both from the research literature [7] and our in-house practice, we have seen that one can fine-tune the models with a limited amount of audio transcript labeling, and then these fine-tuned models can be iteratively improved using the pseudo labels generated by the model. Specifically in Figure 2, we can observe that for the Librilight dataset, one can leverage 60k hours of audios with only 10 hours of transcription labels. Surprisingly, this 10 hours of labeling induces only <2% WER gap compared to tuning with 100X more labels.
Figure 2
Conclusion
Having a good ASR system is key to unlocking downstream NLP capabilities in many enterprise scenarios. We are excited to enable our customer with the SambaNova suite to release the power of customers’ in-house audio data for the best in-domain accuracy.
Acknowledgement
We would like to express our deep appreciation to the work done by the following research groups. Their contributions to the advancement of ASR have not only been invaluable to the open source community, but also have inspired our approach leading to the results we are demonstrating in this blog post.
- Basevski et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
- Xu et al. Self-training and Pre-training are Complementary for Speech Recognition
- Hsu et al. Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
- Babu et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
References
[1] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (https://arxiv.org/abs/2006.11477)
[2] Self-training and Pre-training are Complementary for Speech Recognition (https://arxiv.org/abs/2010.11430)
[3] Common Voice, version 11.0 (https://commonvoice.mozilla.org)
[4] 2000 HUB5 English Evaluation Speech (https://catalog.ldc.upenn.edu/LDC2002S09)
[5] CALLHOME Japanese Speech (https://catalog.ldc.upenn.edu/LDC96S37)
[6] Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training (https://arxiv.org/abs/2104.01027)
[7] XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale (https://arxiv.org/abs/2111.09296)
[8] Librispeech dataset (https://www.openslr.org/12)
[9] Libri-Light dataset (https://arxiv.org/abs/1912.07875)
Footnotes
[a] We benchmarked all the competitors on Mar 6th, 2023.
[b] Amazon Transcribe, Google Cloud Speech-to-Text, and OpenAI Whisper API benchmark procedure disclosure
1. We used the Amazon Transcribe API, Google Cloud Speech-to-Text API, and OpenAI Whisper API to transcribe the following datasets for the four benchmarked scenarios
-
- Telephone conversation (English): 2000 HUB5 English Evaluation Speech
- Telephone conversations (Japanese): test set from CALLHOME Japanese Speech
- Narrative speech (Japanese): test set for Japanese from Common voice, v11.0
- Narrative speech (Hungarian): test set for Hungarian from Common voice, v 11.0
2. Calculate word error rate
- WER is being calculated on the output transcript without casing, punctuations, and numerical character substitutions
- For the telephone conversations (English) benchmark, Kaldi was applied to 2000 HUB4 English Evaluation Speech dataset to compute the WER
- For the remaining 3 benchmarks, WER was calculated between the transcribed text the API output and the ground truth text from the dataset based on edit distance
- Calculate and report the average WER across all measured samples
OpenChatKit model available on SambaNova Suite for community and enterprise model adaptation
SambaNova is pleased to congratulate Together, LAION, and Ontocord.ai on their recent release of the OpenChatKit, an open-source project designed for the development of specialized and general-purpose chatbots. SambaNova is committed to the development of open-source technology and today we are excited to announce that the model that forms the basis of the OpenChatKit: GPT-NeoXT-Chat-Base-20B, will soon be available within the SambaNova Suite. This will allow customers to build their own chat capabilities, optimized and customized with their own data, for their own specific use cases.
“We built OpenChatKit to enable anyone to fine-tune chat models for their own application,” said Vipul Ved Prakash, Together Co-Founder and CEO, “We are thrilled to have SambaNova make the OpenChatKit model available to their enterprise customers to easily create their own bots with the SambaNova Suite.”
GPT-NeoXT-Chat-Base-20B model, one of the main outcomes of OpenChatKit, is built on top of the EleutherAI’s GPT-NeoX model and instruction tuned over the Open Instruction Generalist (OIG) dataset by LAION.ai, a massive collection of 43 million high-quality instructions focusing on conversational interactions.
OpenChatKit is a concrete step towards fostering a more collaborative AI ecosystem
At SambaNova, we are dedicated to powering our customers’ long-term success, and believe that instant adoption of open-source models is a key component of that dedication. Open-source is an important part of the technology landscape because it allows for interoperability, transparency, and the ability to avoid vendor lock-in. This is especially important in the AI space, where innovation is happening rapidly and interoperability is a fundamental need for large enterprises.
The contributions of OpenChatKit by Together, LAION, and Ontocord.ai are invaluable in the field of open-source Large Language Model (LLM) research; they promote transparency and collaboration in the AI community. Utilizing open-source solutions like OpenChatKit can promote innovation, knowledge-sharing, and progress towards a more democratized future for AI.
At SambaNova, we recognize the importance of collaboration in developing AI and are excited to see open-source efforts like the OpenChatKit gain traction. We look forward to embracing and supporting the growing trend of open-source LLM research, and the availability of the OpenChatKit model within the SambaNova Suite is just the beginning. We are committed to being part of this movement and supporting the development of innovative solutions with the open-source community that drive progress for all.
Achieving GPT 175B Level Accuracy with a 10x More Efficient Model
In this blogpost, we show how one can use the SambaNova platform to develop a GPT 13B parameter model that can outperform a GPT 175B parameter model. We achieved greater than 10x efficiency by leveraging advanced ML techniques in pretraining, in-context learning, and data curation practices [1-8,10]. These ML capabilities are available to enterprise users through simple low-code APIs to make it easy for them to achieve better accuracy. Using these APIs, enterprise users can train and deploy models with full ownership and control on their data and model artifacts.
SambaNova’s GPT 13B model achieves comparable accuracy to the OpenAI 175B da-vinci-001 model for in-context learning scenarios and up-to 6% improved accuracy over the 175B da-vinci fine-tuning API for scenarios where some amount of tuning is required, but only minimal labeled data is available. We achieve this by using advanced training pipelines combined with the latest data curation practices and the latest in-context learning techniques [1-8,10]. This makes our pipeline useful in both the exploration and prototyping phase and the deployment and enhancement phase. We make this high quality pre-trained checkpoint available to our customers. We also enable our customers to train their own custom models to such high quality by providing the infrastructure, the tools, and the ML capabilities through simple APIs.
In the subsequent paragraphs, we discuss our approach towards training this GPT 13B model and the research papers that drove the critical decisions in our methodology around data collection, in-context learning, and low resource tuning.
Data Centric Approach to Pre-training
The recent theme of discussion within the NLP community has centered around data. Recent innovations in data practices have led to massive advances in how to extract the most out of a given corpus and to push the abilities of LLMs. To train our GPT 13B model, we use several advanced data centric approaches to create a high quality pre-training corpus, including, mixing of diverse datasets [1][2], data cleaning and data deduplication processes [3][4], using structural signals from the data for weak supervision [5], and instruction tuned and prompted datasets [6][7]. The methods, infrastructure, and the know-how to create these high quality pre-training corpora can be made available to our customers upon request.
Exploration, Prototyping, and In-Context Learning
To explore new tasks, in-context learning is a powerful technique. It allows rapid prototyping to determine whether LLMs can be deployed for your particular use case. Under this setting, we show how the same accuracy can be achieved as the 175B model from OpenAI using a much smaller SambaNova GPT 13B model.
We use 15 benchmarks that are available in standard LLM evaluation suites. These include text generation, classification, question answering, sentiment analysis, and entailment. All are essential for powering a variety of real world use cases, and showcase how our services provide a clear benefit. The names of these benchmarks can be found in Figure 1.
In Figure 1, we compare our 13B GPT model with OpenAI’s 175B numbers available in [9]a , as well as a selection of mainstream community checkpoints including Meta’s OPT and Big Science’s BLOOM. We use an in-context learning technique called “Ask me Anything” (AMA) prompting [10], to generate the few shot numbers for our model. Additionally, even with this advanced prompting technique, the cost of running the 13B GPT model is still less than that of the OpenAI 175B model. As we can see from the figure, on average, our 13B GPT model is iso-accuracy comparable to OpenAI’s 175B model, while outperforming a variety of publicly available checkpoints.
Deployment, Enhancement, and Low Resource Generative Fine-Tuning
While in-context learning allows one to rapidly prototype and understand the usefulness of a LLM for a particular task, eventually, some amount of fine-tuning is required to deploy the model robustly [8] in enterprise scenarios. Labeled data is hard to find in an enterprise setting. As a result, this fine-tuning needs to happen with a very limited amount of labeled data.
To measure the impact of SambaNova’s capabilities and its flexible training pipeline, we recreate common enterprise scenarios and test our 13B GPT model to see how well it performs. Informed by interactions with our customers, we created low resource versions of benchmarks that closely mimic enterprise datasets and real world use cases. Specifically, we sample 100 labeled examples for each class in the benchmark and use it to train our models. Additionally, we use the flexibility of our pipeline to unlock the potential of the unlabeled data for a task via self learningc. The benchmark and the associated results can be found in Table 1. We compare our 13B GPT model trained using our pipeline with the fine-tuned OpenAI’s 175B model using their latest fine-tuning APIb. As seen in the table, our model consistently outperforms OpenAI’s fine-tuned 175B model on every benchmark we evaluated.
Dataset | Representative Enterprise Use Cases | SambaNova 13B GPT |
OpenAI Da-Vinci Fine-Tuning API |
Finance NER | Document processing | 74.46 (F1) | 72.40 (F1) |
MNLI | Contract Review | 64.93 (acc) | 60.2 (acc) |
Dialogue Dataset | Call summarization | 68.29 (acc) 71.52 (weighted F1) |
61.11 (acc) 67.35 (weighted F1) |
Table 1: Fine-tuned results comparison using 100 labeled data for each task
The SambaNova differentiation
This blogpost demonstrates how the SambaNova platform can be used to create custom, state-of-the-art GPT models. This is achieved with a flexible pipeline that enables the most advanced techniques for pre-training, in-context learning, and generative tuning. These capabilities and the high quality checkpoints are both available for our customers to use. This enables them to build their own models, improve their models, or build on top of our checkpoints. In subsequent blog posts we will talk more about our results with instruction tuning, experience with self learning, and the breadth of our capabilities which allow you to explore a wide variety of community checkpoints.
——————
Acknowledgments
We thank Simran Arora, Avanika Narayan, Laurel Orr from Stanford Hazy Research Group for introducing the ASK ME ANYTHING PROMPTING (AMA) for in-context learning through their publication, and the helpful discussions with us.
[a] We use OpenAI paper numbers for this work. We want to compare zero shot and few shot numbers for the benchmarks which requires that the model has not seen the data from that benchmark during training. OpenAI’s paper numbers are the only definitive numbers that adhere to this requirement.
[b] Used the OpenAI finetuning API (https://beta.openai.com/docs/guides/fine-tuning). Capabilities similar to what is available as of 01/2023
[c] We use a simple version of self learning where we fine tune a GPT model on the 100 labeled dataset and use the trained model to create labels for the unlabeled dataset. We have explored more advanced self learning pipelines and while they help improve accuracy, we don’t discuss them here for simplicity. We will talk about how to use our platform to do self learning in a subsequent blog post.
References
[1] :The Pile: An 800GB Dataset of Diverse Text for Language Modeling
(https://arxiv.org/abs/2101.00027)
[2]: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
(https://arxiv.org/abs/1910.10683)
[3]: Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (https://arxiv.org/pdf/2104.08758.pdf)
[4]: Deduplicating Training Data Makes Language Models Better (https://arxiv.org/abs/2107.06499)
[5]: reStructured Pre-training ( https://arxiv.org/abs/2206.11147 )
[6]: Scaling Instruction-Finetuned Language Models (https://arxiv.org/pdf/2210.11416.pdf )
[7]: Multitask Prompted Training Enables Zero-Shot Task Generalization
(https://arxiv.org/pdf/2110.08207.pdf)
[9] Language Models are Few-Shot Learners (https://arxiv.org/abs/2005.14165)
[10] Ask Me Anything: A simple strategy for prompting language models (https://arxiv.org/abs/2210.02441 )
Achieving Best-in-Class Large Language Model Accuracy in Low-Resource Settings
The opportunity: solving a range of language tasks using large language models with zero and few-shot learning
Recent advances in AI have allowed large language models (LLMs) to deliver very impressive capabilities with zero-shot and few-shot learning. These approaches enable impressive versatility, giving LLMs the potential to transform every aspect of an enterprise. As a result we see a race in the LLM space to compete on and improve the zero-shot and few-shot accuracies across a wide variety of tasks to deliver value.
The challenge: achieving consistent accuracy with zero-shot and few-shot learning
These capabilities however are not enough to ensure the adoption of LLMs in enterprise scenarios. Our work with our customers has shown that while zero-shot and few-shot capabilities showcase a LLMs potential to solve a downstream use case, the accuracy achieved using these capabilities is inconsistent. The achieved accuracy can be a function of the prompt or examples used to generate the answer from an LLM. For example [1] showed that for a binary sentiment analysis task (SST-2), depending on the examples used in the few-shot learning, accuracy of the model can vary from random chance (54%) to near state of the art (94.3%). This inconsistency and variance is also very well documented in academia [1][2][3].
Further fine-tuning the LLM with labeled data is an effective method to address the inconsistency challenge mentioned above. However, in real world situations, good quality task specific labeled data is often fairly limited. Being able to achieve not only state-of-the-art, but also consistently good accuracies despite the low-resource labeled data limitation, is the key to enable enterprise adoption.
SambaNova’s approach towards enterprise enablement
To enable consistent accuracy in low-resource settings, SambaNova utilizes a 2-stage approach:
- Pre-training on a high quality, diverse dataset curated using a data-centric approach
- Implementing a tuning pipeline that combines the generalization capability of parameter efficient techniques in low resource settings with self training
This data-centric approach focuses on collecting a large corpus of diverse datasets from various sources, such as books, news articles, conversations, entertainment, wikipedia, and research articles. These data sources are not just highly varied by type, but also by the structure of the data, such as completely unstructured, prompted, instructions via weak supervision etc. This data-centric approach enables more accurate results when fine-tuning in low resource settings.
One other consideration is that fine-tuning on a small dataset can be challenging and can sometimes lead to overfitting. To avoid this, SambaNova fine-tunes an LLM using a training methodology that incorporates practices like parameter efficient tuning. Accuracy of a model in a low resource setting can be further improved by using advanced techniques like self training – a method to pseudo label an unlabeled dataset. By combining self training with the right methodology, SambaNova ensures our well trained models perform exceptionally well in the low resource enterprise scenario.
SambaNova’s differentiations on pre-trained GPT
To demonstrate the real world impact of this methodology, SambaNova applies this data-centric approach to a pre-trained and fine-tuned GPT model. We show that SambaNova’s GPT model not only outperforms leading GPT offerings such as OpenAI’s 175b parameter modela, but does so with a model that is multiple orders of magnitude smaller. We benchmark our smaller model against a much larger 175b model from OpenAI across 8 tasks ranging from text generation, question answering, entailment, and text classification in scenarios mimicking the low resource settings required by our enterprise customers. What we observe is that, on average, our product offering can outperform few-shot results from OpenAI 175b model by a margin of 1.5%.
To further test the capabilities of our product, we also evaluated the same smaller model against a fine-tuned 175b model from OpenAI using their fine-tuning APIb,c. We again operate in the low resource setting where we only have 100 labeled examples for each task. What we observe is that we consistently outperform the fine-tuned 175b model anywhere from 1.2% to 8%. For a financial NER task, our model achieves an accuracy of 73.60%, while the fine-tuned 175b model achieves 72.40%. For an entailment task, our model achieves 53.89% while the 175b model achieves 44.58% accuracy.
These results show the extraordinary impact a data-centric approach and a well developed tuning pipeline can have on the quality of a model. Delivering highly accurate pre-trained LLMs is a core focus of SambaNova’s product strategy to help customers to get to value faster.
Conclusion
We show that with a model that’s multiple orders of magnitude smaller, SambaNova GPT model outperforms the leading GPT-175b model in challenging low resource downstream settings, a common enterprise scenario where labeled data is hard to acquire. We achieve this milestone by taking both a data-centric approach towards model development and focusing on developing an advanced training pipeline.
Footnote
[a] We use the paper numbers from OpenAI to ensure that the numbers are truly zero-shot and few-shot. It’s unclear whether the model used by OpenAI’s current API has already incorporated that data in its training corpus
[b] OpenAI’s fine-tuning API does not provide an option to select between davinci-002 or davinci-003, so it’s unclear which version of the 175b model it uses internally
[c] We use the default hyperparameters provided by the fine-tuning API
References
[1] Calibrate Before Use: Improving Few-Shot Performance of Language Models
[2] Multitask Prompted Training Enables Zero-Shot Task Generalization
[3] Do Prompt-Based Models Really Understand the Meaning of their Prompts?
Dataflow Architecture Leads to a Performance Breakthrough on GNN Fused Kernels
A Collaboration between SambaNova Systems and Argonne National Laboratory
Using the capabilities of the SambaNova DataScale® system, researchers at the U.S. Department of Energy’s Argonne National Laboratory and SambaNova partnered on a collaboration to advance state-of-the-art performance of Graph Neural Network (GNN) kernels.
GNNs, which learn from graph representations of non-euclidean data, are rapidly rising in popularity, and are used in several computationally demanding scientific applications. However, GNNs have been known to suffer from hard memory and computational bottlenecks on traditional hardware platforms due in part to their reliance on non-regular data structures.
In this collaborative research, we demonstrate that the dataflow architecture of the SambaNova Reconfigurable Dataflow UnitTM (RDU) plus the significant on-chip memory can provide better performance on GNNs. Additionally, as part of this collaboration, a novel performance evaluation methodology was used to demonstrate significant inference performance speed up for certain operations related to state-of-the-art GNNs on SambaNova’s DataScale system.
The results show competitive performance for sparse operations related to GNNs without any explicit optimization, and show significant inference performance speed up for these operations. More specifically, the DataScale platform provides 2-3X performance compared to a leading GPU chip competitor on index_select_reduce, a key building operation of the sparse operations such as sparse-dense matrix multiplication.
Automated Kernel Fusion Enabled by SambaNova’s Dataflow Architecture
GPUs and other conventional AI accelerators run ML applications kernel-by-kernel which requires frequent on- and off-chip data transfer. However, the SambaNova RDU provides a flexible, dataflow execution model that pipelines operations, enables programmable data access patterns, and minimizes excess data movement found in fixed, core-based, instruction set architectures employed by GPUs.
As depicted in Figure 1a, the kernel-by-kernel execution on a GPU executes one operation at a time and materializes the entire intermediate data off-chip, which requires high off-chip bandwidth for performance. In contrast, dataflow execution on an RDU (Figure 1a) enables high computation resource utilization and achieves the effect of kernel fusion that minimizes memory transfer costs, and removes the need to create custom one-off fused kernels which requires a specialized team of engineers.
Machine learning dataflow graphs are spatially compiled to the RDU. In this spatial mapping, communication between the graph nodes stays on chip and is efficiently handled by the on-chip interconnect compared to traditional architectures. The impact is a dramatic reduction in off-chip bandwidth compared to kernel-by-kernel execution.

(a) Kernel-by-kernel execution for GPUs

(b) Dataflow execution for RDU
Figure 1: Dataflow/architecture comparison of GPU and SambaNova RDU
Additionally, the compiler for DataScale, SambaFlow, captures the ML application as a compute graph with parallel patterns and explicit memories, and systematically lowers the graph to a logical dataflow pipeline that exploits data, task, and hierarchical pipeline parallelism. Figure 1(b) shows dataflow execution on an RDU that concurrently executes multiple kernels of the same GNN model as a dataflow pipeline. Intermediate results are produced and consumed entirely on-chip, which lowers off-chip bandwidth requirements. This dataflow architecture has the potential to greatly accelerate development of GNN foundation models.
(a) AddMM
(b) MatMul
(c) Index_add_reduce, and Index_select_reduce
Figure 2: RDU Performance comparison with A100 of various GNN operations
As depicted in Figure 2a, the RDU shows better performance on basic GNN arithmetic operations compared to an A100. For example, the matmul shows comparatively better performance, especially for larger input sizes (with similar FLOPS). This is mainly due to the sufficient on-chip memory and good tiling support of the SambaFlow compiler. On the other hand, as depicted in Figure 2b, the RDU provides a 2-3x performance boost compared to A100 on index_select_reduce. This is due to the fact that the RDU on-chip memory bandwidth (150TB/s) is significantly faster than the A100’s off-chip high bandwidth memory (HBM) bandwidth (2TB/s). The performance of the RDU on index_add_reduce is similar, with a reported 30-50% performance boost, depending on the input size.
Novel Profiling Framework
One of the main challenges with deep neural networks in production is meeting the low latency requirements required by downstream workflows. Many domains require low-latency inference of user data, often at very small batch sizes. However, as the number of parameters in state-of-the-art architectures continue to increase, inference performance, in terms of runtime, can become progressively problematic.
As mentioned previously, emerging AI accelerators, such as the RDU, provide specific optimizations for runtime performance of deep learning models such as GNNs. Performance can be even further improved with additional optimization that can require substantial engineering effort. Thus, a method for assessing the viability of specific applications can help hardware developers assess the viability of development of specific models. Such a method can also help application developers assess the viability and promise of AI accelerators for different use cases.
In this collaboration, Argonne addresses the issues described above by introducing the microbenchmark framework, which facilitates comprehensive and equitable performance comparisons of portions of end-to-end deep learning architectures on emerging hardware platforms. This framework can also be used to assess performance bottlenecks and inform future optimizations at the hardware, system software, and algorithm level.
Conclusion
Due in part to systems and hardware features (i.e. dataflow architecture and smart compiler support) that result in automatic kernel fusion and memory optimization, RDU platforms present an exciting opportunity to accelerate the inference of GNNs compared to existing, well established systems such as A100 GPUs.
These capabilities present exciting potential for the development of GNN foundation models. As part of the continuing collaboration, Argonne and SambaNova are developing pre-trained GNN foundation models that are deployed at scale to serve downstream tasks with better performance and accuracy.
Acknowledgements
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357