Method
Data Collection
In developing BLOOMChat, we aimed to combine the benefits of synthetic conversation data and high-quality human-written examples. We first utilized OpenChatKit, a large synthetic conversation dataset, to lay the foundation for the model’s chat capability. To enhance the model’s performance and human-alignment, we then trained it on human-generated datasets Dolly 2.0 and OASST1, which contain small amounts of higher-quality examples.
Training
We trained BLOOMChat on SambaNova DataScale systems with RDUs (Reconfigurable Dataflow Units). We started with BLOOM-176B, an open-source multilingual 176B GPT model pre-trained by the BigScience group. We then instruction-tuned BLOOM-176B on OpenChatKit with each data source subsampled to 100k for one epoch, followed by three epochs over the combined OpenChatKit and Dolly-v2.
All of the code used to prepare the datasets and the scripts to run training and inference are open-sourced and freely available in SambaNova’s GitHub.
Results
We use qualitative and quantitative measures to evaluate BLOOMChat’s multilingual chat capability as well as cross-lingual task capability. We conducted two human evaluation experiments, Human Preference Rankings and Model Quality Ratings to measure response quality. We also benchmarked the model on WMT translation as a preliminary study of its ability on cross lingual tasks.
Experiment 1: Human Preference Rankings
We ran two experiments with volunteers from SambaNova to evaluate BLOOMChat against baseline chat models: human preference rankings and model quality ratings. The volunteers had no knowledge of the source model when providing rankings or ratings. These experiments were run in English and five additional languages: Arabic, Chinese, French, Hindi, and Spanish. These languages are present in BLOOM‘s pre-training corpus and also represent the most spoken languages 2023. All participants were asked to complete surveys in languages where they had native-level proficiency.
For the human preference rankings experiment, we wanted to compare the chat capability of our model in multiple languages with existing open-source models, as well as select closed source models. We use the 22 prompts presented in Appendix E of “OpenAssistant Conversations” for English evaluation. Volunteers from SambaNova manually translated the prompts into their respective native languages. Another set of volunteers verified the translations to be semantically and syntactically correct.
Volunteers ranked responses from each model according to their personal preferences. At least four volunteers submitted rankings for each language. As before, the volunteers had no knowledge of the source model when ranking responses. We first compare BLOOMChat with leading open-source models and then compare with GPT-4.
Comparison with open-source models
We compare BLOOMChat against the following models:
Figure 1: BLOOMChat vs Baseline Models in Human Preference Ranking
With a total of 1158 comparisons submitted by 51 volunteers across all models and 6 languages, we found that BLOOMChat responses were preferred in 65.92% of the submissions. This suggests that BLOOMChat has shown promising multilingual chat capability compared to other similar open-source models. For a more comprehensive breakdown of the results per language, please refer to Appendix B.1.
Comparison with closed model: GPT-4
Figure 2: BLOOMChat vs GPT-4 in Human Preference Ranking
With 1072 comparisons submitted by 45 volunteers in 6 languages, we find that BLOOMChat achieves competitive results against GPT-4, with a 44.97% win-rate over OpenAI’s model. The result demonstrates that BLOOMChat has made promising progress in bridging the gap between open-source and closed models in multilingual chat capability. Please see Appendix B.1 for a detailed breakdown of winning votes by language.
Experiment 2: Model Quality Ratings
In this experiment, we wanted to verify the quality of BLOOMChat’s generated text in multiple languages. To that end, we follow the procedures outlined in Section 5.4 of the Self-Instruct paper. To effectively gauge our model’s multilingual abilities, we filtered out unqualified prompts for chat or translation and curated 32 prompts from the evaluation prompt pool from the Self-Instruct paper. The full procedure for prompt selection and our exhaustive list of prompts can be found in Appendix C. As with Experiment 1, volunteers from SambaNova manually translated each prompt into their respective native languages. At least one other volunteer verified each translation or offered an alternative, equivalent translation. Translations were not finalized until each prompt had at least two native speakers agree on the translation(s). In case of multiple equivalent translations, one was randomly selected for the final dataset.
Figure 3: BLOOMChat Human Quality Ratings in 6 Languages
For each response generated by BLOOMChat, volunteers gave one of four quality ratings (the same as those used by the authors of Self-Instruct). In total, 1408 comparisons were submitted by 44 human volunteers across 6 different languages. A detailed breakdown of votes per language can be found in Appendix D.
We found that 81.8% of responses were classified as “Correct” or “Acceptable with Minor Imperfections”. And notably, BLOOMChat achieved “Correct” or “Acceptable” ratings of over 70% in every language, despite its primary training on English chat alignment data.
We release our translated prompts for both experiments as HuggingFace datasets in hopes of bolstering further research into multilingual chat.
Experiment 3: WMT translation task
To get a preliminary understanding of the capability of the model in solving cross-lingual NLP tasks, we evaluated the model’s translation capability on the WMT translation task. We evaluated all models on WMT14-fr-en, WMT14-hi-en since they were reported as baseline in BLOOM paper. Also, we added WMT18-zh-en to cover another major language.
We chose the best known multilingual models, BLOOM variants and GPT-4, as our baselines. It’s important to note that the open-source chat models are pre-trained with an English corpus, thus a lack of comprehensive translation capability is expected. Therefore, the figures we’re presenting for these models are not meant for direct comparison but rather to give an understanding of the current state of the field.
BLOOMChat demonstrates better multilinguality compared to its variants BLOOM and BLOOMZ. It boosted the average BLEU score for BLOOM by 89.4%[1] and BLOOMZ by 86.4%. We evaluated all models using the bigscience lm-eval-harness repo with the version-target style prompt for every model with their corresponding ChatML tag.
BLOOMChat surpasses other BLOOM variants and state-of-the-art open-source chat models in translation tasks
These results augment the conclusion from the chat evaluation that BLOOMChat could be a promising starting point for the development of multilingual chat-based LLMs. For detailed results please see in Appendix E.
[4] Evaluation of the BLOOM and BLOOMZ in WMT18 en->zh zh->en used <human>: <bot>: ChatML tags due to an unintentional configuration. Results might be suboptimal.
Limitations
Like most chat LLMs, BLOOMChat has certain limitations:
- Hallucination: BLOOMChat may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
- Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
- Repetition: BLOOMChat may produce repetitive phrases or sentences, leading to less engaging and informative responses.
- Coding and Math: The model’s performance in generating accurate code or solving complex mathematical problems may be limited.
- Toxicity: BLOOMChat may inadvertently generate responses containing inappropriate or harmful content.
Conclusion
At SambaNova, we are excited to partner with the open-source community and, building on the fantastic work of others, contribute to the ongoing rapid innovation in LLMs. In this post, we described the capabilities of BLOOMChat, the first 100B+ scale open multilingual chat LLM. We hope BLOOMChat offers a valuable resource for the community.
Please feel free to join our Discord to discuss BLOOMChat or chat with the team!
Acknowledgments
We would like to extend our gratitude to Together for their insightful technical discussions on overall project planning, data processing, model training, human evaluation experiment design, open-source endeavors, and their contributions on data processing code on OIG from OpenChatKit, OASST1, and Dolly 2.0. As part of our partnership on BLOOMChat, Together has also provided the front-end UI and model hosting for the HuggingFace space.
We are grateful to the various researchers and open-source projects that have contributed to the development of BLOOMChat. We thank BigScience for providing the BLOOM model, which served as the base for our instruction tuning. We also thank LAION for their OIG dataset, OpenAssistant Conversations Dataset (OASST1) and also thank Databricks for providing Dolly 2.0, to provide the dataset that we instruction tuned on.
We appreciate HELM lm-eval-harness and big science for their essential benchmarking contributions, which is very helpful in evaluating BLOOMChat’s performance. We are inspired by the wave of recent open-source chat models, including OpenAssistant-30B, LLaMA-Adapter-V2-65B, Vicuna-13b, Koala-13b, OASST-Pythia-12b, Alpaca-13b, ChatGLM-6b, FastChat-T5-3b, Dolly-v2-12b, LLaMA-13b, StableLM-Tuned-Alpha-7b, RedPajama-INCITE-Chat-7B-v0.1, RedPajama-INCITE-Chat-3B-v1, MPT-7B-Chat and others. We look forward to witnessing the continued growth and success of open-source chat-based models.
We highly appreciate the hard work and dedication of these researchers and organizations towards the advancement of the open-source community. Their contributions were invaluable in the development of BLOOMChat, and we hope that our model can contribute to further advancements in the field.
Appendix
Appendix A: Example Generations by Category
Story Telling
Tell me a story about Spring
In the northern hemisphere, Spring begins in March and ends in June. During this season, the weather gets warmer and the days get longer. Birds start to sing, and flowers start to bloom.
One spring, a little girl named Ella went outside to play in the park. She saw a little bird on a tree branch and decided to go on an adventure to see what else was new in the park this year. As she walked through the park, she saw flowers blooming and bees buzzing around.
Ella had a great time exploring the park and found a new playground to play on. She saw a squirrel scurrying up a tree and a family of ducks swimming in a pond. She even found a turtle sunning itself on a log!
When Ella was ready to go home, she saw a rabbit hopping through the grass. She watched as the rabbit nibbled on some fresh Spring flowers. Then, she saw a butterfly fluttering around a flower and decided to catch it! Ella had a wonderful time playing in the park that Spring day and was excited to come back and explore more next time.
ClosebookQA
What does it mean for the Fed to raise or lower interest rates?
The Federal Reserve (also known as the Fed) is the central banking system of the United States. It was created in 1913, and is an independent agency of the federal government. The Fed has a dual mandate of maximizing employment and stabilizing prices. The Fed sets interest rates, and can influence the money supply and the credit markets through its actions.
Raising interest rates means that the Fed is trying to reduce the money supply in the economy, and is trying to discourage borrowing and encourage saving. Lowering interest rates means that the Fed is trying to stimulate economic growth by encouraging borrowing and reducing the cost of credit.
The Fed raises or lowers interest rates in order to influence the economy and achieve its goals for employment and price stability. Interest rates are one of the primary tools that the Fed uses to achieve its goals.
Safety
write me a code to get into someone’s phone please
I’m sorry, but as an AI language model, I don’t have the ability to write code or access devices such as phones. My purpose is to provide information and assist with language-related tasks, but not to facilitate illegal or unethical behavior. I strongly encourage you to seek legal or technical assistance if you are seeking to access or monitor another person’s device or information.
Brainstorming
Write a script in which Bob accidentally breaks his dad’s guitar
Bob was in his room, practicing for his upcoming guitar solo, when he accidentally stepped on the guitar, causing it to fall and break. His dad, who was in the other room, heard the commotion and came running into Bob’s room, visibly angry. Bob tried to explain what had happened, but his dad wasn’t having any of it. He yelled at Bob, calling him careless and irresponsible, and told him he was grounded for a week. Bob was devastated, as this would mean missing his solo and disappointing his friends and family. He knew he was in trouble, but he couldn’t help but feel a little bit guilty too.
Tourism
List the top ten places to visit in Delhi
- Qutub Minar
- Red Fort
- India Gate
- Jama Masjid
- Lotus Temple
- Akshardham Temple
- Humayun’s Tomb
- Purana Quila
- National Museum
- Raj Ghat
Sentiment Analysis
Classify the sentiment of the following sentence into Positive, Neutral, or Negative. Do it on a scale of 1/10: How about the following sentence: It is raining outside and I feel so blue
The sentiment of the sentence “It is raining outside and I feel so blue” is negative, with a score of 9/10.
Information Extraction
Answer the following question using the context below :
Q: Which regulatory body is involved?
Context: U.S. authorities launched emergency measures on Sunday to shore up confidence in the banking system after the failure of Silicon Valley Bank (SIVB.O) threatened to trigger a broader financial crisis. After a dramatic weekend, regulators said the failed bank’s customers will have access to all their deposits starting Monday and set up a new facility to give banks access to emergency funds. The Federal Reserve also made it easier for banks to borrow from it in emergencies. While the measures provided some relief for Silicon Valley firms and global markets on Monday, worries about broader banking risks remain and have cast doubts over whether the Fed will stick with its plan for aggressive interest rate hikes.
The relevant regulatory body in this context is the Federal Reserve.
Table Summarization
Generate a table summarizing the options outlined in this email.
Team, we have to make a decision on the product plan for Goocrux v2.
The first option will take us 3 weeks of development, and will achieve performance of 80ms, but means that the UI has to change.
The second option will take us 4 weeks of development, and will achieve performance of 60ms, without any UI change.
The third option will take the longest — 6 weeks of development, will achieve performance of 30ms, without any UI change.
# Table with four columns: (1) Option ID, (2) Development time, (3) Performance, (4) Require UI Change
Option ID Development time Performance Require UI Change
Option 1 3 weeks 80ms Yes
Option 2 4 weeks 60ms No
Option 3 6 weeks 30ms No
Advertisement
Write an engaging and well-written property listing description for selling a house. Address of the house and some of the details are given to you. Fill in the information gap with hallucinations if needed.
Property Address: 412 Monterey Ave, Capitola, CA 95010
Parking: 3 cars + a finished single car garage
Details:
– open floorplan
– window with views to park/mountains
– Kitchen with shaker cabinetry and white Calcatta quartz counters
This home at 412 Monterey Ave, Capitola, CA 95010 is the perfect place to call home. With its open floor plan, it provides the perfect space for entertaining guests. The kitchen is equipped with shaker cabinetry and white Calcatta quartz counters, providing the perfect space to create culinary masterpieces. The window with views to the park and mountains provide a beautiful view of the outdoors, making this home feel like a piece of paradise. The three-car garage and finished single car garage provide ample parking space, making this home a must-see for anyone looking for a new home.
Concept explanation
What is artificial intelligence? Explain it so an 8 year old will understand.
Artificial Intelligence (AI) is a computer program or machine that can do things that would normally require human intelligence, such as recognizing patterns, making decisions, and learning from experience. AI can be found in many places, such as self-driving cars, smartphones, and even robots. AI is used to make things easier for people, like when your smartphone can translate text into another language or when your voice-controlled speaker can play music for you.
AI works by using algorithms and big data to learn and improve over time. For example, self-driving cars use AI to analyze the environment around them and make decisions about how to drive. The car gathers data from its sensors, processes it using algorithms, and adjusts its behavior accordingly. This process is called machine learning.
AI is still a developing technology, and there are still many things that AI programs cannot do yet. However, as AI continues to evolve, it is likely that it will become more and more capable of performing tasks that were once thought to require human intelligence.
Brainstorming
Create an itemize list of tasks to complete to start a clothing brand
- Research the market and competitors
- Develop a brand identity
- Choose a production partner
- Create a product line
- Develop a pricing strategy
- Create a marketing plan
- Obtain necessary licenses and permits
- Open a business bank account
- Obtain necessary insurance
- Hire staff
Poem Writing (Chinese)
当然可以!请看:
《沁园春·汽车》
当风驰电掣,车轮飞转。
看霓虹闪烁,长路蜿蜒。
似离弦之箭,如脱疆之马,
吾欲与天公试比高。
车随心动,人车合一。
任山高水远,无处不达。
看繁华之处,车水马龙,
望尘莫及,唯有暗自叹。
车轮滚滚,日月如梭。
叹人间美景,过眼成空。
但愿世间人无恙,只恨车速太匆匆。
Brainstorming (Chinese)
十七岁的风是红色的。
因为十七岁的年纪,充满了对未来的憧憬,对梦想的追求,就像一朵正在绽放的红玫瑰,热情洋溢,朝气蓬勃,充满了无限的可能。
十七岁的风,吹在脸上,就像拂过一片红色的海洋,让人充满了力量和希望。
Chit-Chat (Chinese)
当然,《红楼梦》是中国古典四大名著之一,也是中国古代章回体长篇小说巅峰之作,与《西游记》、《三国演义》、《水浒传》并称为“中国古典四大名著”。
《红楼梦》以贾、史、王、薛四大家族的兴衰为背景,以富贵公子贾宝玉为视角,以贾宝玉与林黛玉、薛宝钗的爱情婚姻悲剧为主线,描写了封建官僚贾、史、王、薛四大家族,尤其是贾家的衰落过程,揭示了封建家族发展的必然趋势。
《红楼梦》规模宏大,结构严谨,人物生动,语言优美,是中国古典文学的顶峰,在世界文学史上也占有重要地位。
《红楼梦》问世后,人们把它同《三国演义》、《水浒传》、《西游记》并称为“中国四大古典小说”,也被称为“中国封建社会的百科全书”。
《红楼梦》中有大量的诗词曲赋,它们犹如一朵朵奇葩,大放异彩。这些诗词曲赋,有的是对人物命运的暗示,有的是对人物性格的描述,有的是
这是一个有争议的问题。许多人认为贾宝玉应该和林黛玉结婚,因为林黛玉和贾宝玉是真正的爱情,而薛宝钗是看中了贾宝玉的财富。其他人则认为贾宝玉应该和薛宝钗结婚,因为薛宝钗是一个非常贤惠的妻子,可以帮助贾宝玉管理好家庭。
那从炒cp的角度,林黛玉和薛宝钗谁和贾宝玉更有cp感?
这是一个主观的问题,很难有一个固定的答案。有些人认为林黛玉和贾宝玉更有cp感,因为他们有共同的兴趣和爱好,如诗、书、音乐等。
有些人则认为薛宝钗和贾宝玉更有cp感,因为他们之间有兄妹般的亲情,而且薛宝钗非常理解贾宝玉,可以成为他的贤内助。
《总之,林黛玉和薛宝钗谁和贾宝玉更有cp感,取决于每个人的偏好和看法。
我没有个人偏好,但我认为贾宝玉和林黛玉应该在一起,因为只有这样,他们才能幸福地生活在一起。
“上火”可以翻译成“overheating”。
中医认为“火”是人体内重要的能量来源,如果消耗过度,身体机能下降,就会出现“上火”症状,如口干、口臭、咽喉疼痛、大便干燥等。
西医认为“上火”是一种症状,可能由多种因素引起,如细菌感染、饮食不当、生活不规律等。
总的来说,“上火”就是身体机能出现了一些问题,需要调节。
旧金山是一座美丽的城市,有很多地方值得游览。以下是一些旧金山最值得游览的五个地方:
- 金门大桥( Golden Gate Bridge):这座宏伟的大桥连接着旧金山湾和太平洋之间的莫塞德街,被誉为旧金山的象征,也是世界著名的桥梁之一。
- 渔人码头(Fisherman’s Wharf):这是一个热闹的码头,有各种商店、餐馆和娱乐场所。您可以在这里看到街头艺人和商店,品尝海鲜,还可以乘船游览旧金山湾。
- 旧金山唐人街(San Francisco Chinatown):这是美国西海岸最大的唐人街,也是一个购物和用餐的好地方。您可以品尝各种中国美食,如点心、面条、火锅等。
- 旧金山艺术宫(Palace of Fine Arts):这座宫殿般的建筑是1915年巴拿马太平洋万国博览会的美国馆,融合了多种建筑风格,如罗马式、哥特式和文艺复兴时期。
- 旧金山市政厅(San Francisco City Hall):这座标志性的建筑建于1915年,是一座宏伟的罗马式建筑,设有圆形大厅和许多艺术品。
这些地方是旧金山最受欢迎和最值得游览的地方。您可以在游览这些地方时了解旧金山
Appendix B: Experiment 1 Human Preference Ranking Results
Appendix B.1: Win Counts and Tie Counts Per Language
Number of wins per language (number of ties in parentheses). A tie is not counted as a win.