Blog

BLOOMChat-v2 Long Sequences at 176B

by SambaNova ML Engineering

February 7, 2024

We are proud to release BLOOMChat-v2, a 32K sequence length, 176B multilingual language model trained on top of BigScience’s BLOOM [1] model. It is the largest (by parameter count) open-source model that can be run with 32,768-length sequences. In this blogpost, we present in-depth training details, as well as extensive evaluations and comparisons to the base checkpoints. We show that BLOOMChat-v2 achieves up to 170% improvement over BLOOM on SCROLLS [9], a long-sequence benchmark. BLOOMChat-v2 was trained on RDU [11] with mixed-precision bfloat16.

BLOOMChat-v2 model card: https://huggingface.co/sambanovasystems/BLOOMChat-176B-v2

For example training and inference scripts, check out our BLOOMChat GitHub repository: https://github.com/sambanova/bloomchat

If you have any further questions, feel free to ask via our public Discord server!

Finetuning Details

BLOOMChat-v2 followed a 3-stage finetuning procedure:

Long sequence pretraining
Instruction tuning
Alignment

Stage 1 serves to extend the model’s native maximum sequence length beyond 2048. Stages 2 and 3 follow a similar protocol to BLOOMChat-v1 (see blogpost here), and serve as chat-alignment steps. For all stages, we pack data into sequences of 8192.

STAGE 1: LONGE SEQUENCE PRETRAINING

In stage 1, we continuously pretrained BLOOM on approximately 30B tokens of multilingual data packed into sequences of length 8192. We refer to the resulting model after stage 1 BLOOM-LSS, which we evaluate on SCROLLS below.

The stage 1 data was a mixture of crawled web data and curated high-quality text that covered a subset of the languages in ROOTS [3], BLOOM’s original pretraining dataset. For further details on dataset preprocessing and the makeup of the long sequence dataset mixture, see Appendix A and Appendix B respectively.

STAGE 2: INSTRUCTION TUNING

In stage 2, we train BLOOM-LSS on the OIG dataset for 1 epoch. Inputs are packed into sequences of length 8192, and we apply completion-only loss masking. We also apply the <human>: and <bot>: chat templates to the input and completion texts respectively.

See Appendix C for further training hyperparameters.

STAGE 3: ALIGNMENT

In stage 3, we train on the combined databricks-dolly-15k and oasst1 datasets for 16 epochs. As in the previous stage, inputs are packed into sequences of length 8192, we apply completion-only loss masking, and we apply the <human>: and <bot>: chat templates. See Appendix C for further training hyperparameters

BLOOMChat-v2 is the resulting model after stage 3.

PRECISION

All 3 stages were run in mixed-precision bfloat16 with float32 master weights on RDU. We ensured that ALiBi [8] and multi-head attention operations were performed in FP32. These choices are motivated by the observations in our previous blogpost on the relationship between ALiBi and lower precision data formats. Intuitively, we want to reap the throughput benefits of training in lower precision, whilst maintaining the integrity of training on long sequences.

To verify our intuitions, we perform a small-scale ablation where we continuously pretrain BLOOM-560M (the smallest variant) on the same long-sequence dataset as in stage 1. We run two experiments: one entirely in BF16, and the other with multi-head attention and ALiBi in FP32 (with the rest of the model’s operations in BF16). In the plots below, we call these models bf16-MHA and fp32-MHA respectively.

BLOOM 560M Long Sequence Perplexity charted

In Figure 1, we plot perplexity on a held-out validation dataset, with sequences filtered to be of length 4000 or greater. The x-axis denotes sequence position, and the y-axis denotes the average perplexity of that position across the held-out dataset.

Our fine-tuned models (orange and green) both clearly outperform the base model in blue. However, we can see that in the later sequence positions, the bf16-MHA model has higher perplexity. This suggests that positional information may be less effective for higher sequences in lower precision. Thus, for all large-scale long-sequence training, we fix the precision of multi-head attention and ALiBi to be in FP32.

Evaluations

LONG SEQUENCE TASKS AT 8K

We evaluate BLOOM and BLOOM-LSS on SCROLLS [9], a collection of long-sequence natural language benchmarks including summarization and question-answering tasks. We use the Eleuther-AI lm-evaluation-harness to run our evaluations in a zero-shot manner, with a maximum sequence length of 8192. We do not report on ContractNLI, which we observed to have inconsistencies with publicly reported numbers. BLOOM-LSS outperforms BLOOM on every benchmark, achieving 170% improvement on average.

Chart of BLOOM and BLOOM-LSS evaluated on SCROLLS

LONG SEQUENCE TASKS AT 32K (W/ INTERPOLATION)

We additionally evaluate BLOOM-LSS on SCROLLS with a minimum sequence length of 8,192 and a maximum sequence length of 32,768. In these evaluations, all input sequences are longer than those seen during pretraining for both models.

We show results both with and without positional interpolation (PI), as described in [10]. For results with PI, we interpolate from each models maximum pretraining sequence length (2048 for BLOOM, 8192 for BLOOM-LSS) to 32,768.

Chart of BLOOM, BLOOM-LSS, BLOOM (PI), and BLOOM-LSS (PI) evaluated on SCROLLS

We observe in Figure 3 that BLOOM and BLOOM-LSS both underperform their PI’d counterparts. This is particularly the case for summarization benchmarks like GovRep, QMSum, and SumScreenFD. Overall, BLOOM-LSS with PI attains the highest scores on all tasks.

Sample Multilingual Generations

We re-sample multilingual generations from BLOOMChat-v2 to verify that our long sequence finetuning procedure has maintained the multilingual chat abilities of BLOOMChat-v1. Note that these generations are sampled with the same <human>: and <bot>: tags that the model was finetuned with.

Create a survey that asks about eating and cooking habits.

Sure! Here’s a sample survey that can be used to gather information about eating and cooking habits:

Do you cook at home regularly?
If so, what are your favorite cooking methods
What type of cuisine do you prefer to eat?
Do you have any food allergies or restrictions?
How often do you eat out?
What are your favorite restaurants or dishes to order from?
How do you feel about meal-prepping?
Do you have any tips for keeping healthy while eating out?
What’s your favorite way to incorporate vegetables into your meals?
What’s your favorite way to cook protein?

Feel free to copy and paste this list, or use it as a reference to create your own survey.

请向我推荐一部周末看的电影并说明原因。

(Recommend a movie for me to watch during the weekend and explain the reason.)

我推荐的电影是疯狂动物城(Zootopia)。这部电影讲述了一个发生在动物城市里的故事。主人公是一个名叫Judy的兔子警官，她也是一个初来乍到的新人。因为初来乍到，所以她受到了其他动物同事的排挤。不过她并没有气馁，而是用自己的行动证明了自己。在解决一个一个案子，打击罪犯的同时，她也在这个过程中收获了友情和爱情。这部电影不仅有深刻的教育意义，而且还有好看的动作和搞笑的场面。所以我觉得这是一部很适合周末看的电影。

दिल्ली से जयपुर कैसे जाये? बस से या कार से? दोनो के बारे मे विस्तार मे बताये.

(How do I go from Delhi to Jaipur? Bus or car? Details please.)

आप दिल्ली से जयपुर कार से या बस से जा सकते है, दोनों ही तरीके आसान है।

अगर आप कार से यात्रा करते है तो यह यात्रा 4 घंटे में पूरी हो जाएगी। आपको दिल्ली से जयपुर के लिए कुछ पॉपुलर मार्ग इस तरह से है –

नेशनल हाई-वे 2 (NH 2) – यह दिल्ली से जयपुर जाने का सबसे पॉपुलर मार्ग है और यह पूरी तरह से एक फोर-लेन मार्ग है। इस मार्ग पर आप बिना किसी रोक-टोक के दिल्ली से जयपुर तक पहुंच सकते है।
दिल्ली-जयपुर मार्ग (Delhi-Jaipur Highway) – यह दिल्ली से जयपुर जाने का एक और आसान मार्ग है। यह जयपुर से दिल्ली की और जाने वाले NH 2 को जयपुर में दिल्ली-जयपुर राष्ट्रीय राजमार्ग (Delhi-Jaipur National Highway) कहते है।
दिल्ली-जयपुर मार्ग (Delhi-Jaipur Highway) – यह NH 2 दिल्ली से जयपुर तक जाने का एक और मार्ग है। यह दिल्ली से जयपुर की और जाने वाले NH 2 से जुड़ा हुआ है और दिल्ली से जयपुर तक सीधा जाता है।
दिल्ली-जयपुर मार्ग (Delhi-Jaipur Highway) – यह NH 2 दिल्ली से जयपुर तक जाने का एक और मार्ग है। यह दिल्ली से जयपुर की और जाने वाले NH 2 से जुड़ा हुआ है और दिल्ली से जयपुर तक सीधा जाता है।
जयपुर जाने के लिए आप इनमें से किसी भी मार्ग का चुनाव कर सकते है और आसानी से जयपुर पहुंच सकते है।

जयपुर जाने के लिए आप सरकारी बसों का भी चुनाव कर सकते है। दिल्ली से जयपुर के लिए कई बसों की सुविधा उपलब्ध है। दिल्ली से जयपुर के लिए आप दिल्ली के सराय रोहिल्ला बस अड्डे से आसानी से बस ले सकते है। सराय रोहिल्ला से जयपुर के लिए कई बसें उपलब्ध है जो आपको जयपुर पहुंचा देगी।

जयपुर जाने के लिए आप प्राइवेट कैब का भी उपयोग कर सकते है। यह सुविधा आप ऑनलाइन भी ले सकते है। कैब से दिल्ली से जयपुर जाने में आपको लगभग 4 घंटे लगेंगे।

¿Me puedes dar algunas recomendaciones de juegos? Estoy buscando juegos de rol con un sistema de magia único, un sistema de construcción único y una buena historia. Por favor limítate a juegos que se puedan jugar en PC directamente, o por emulación.

(Can you provide some game recommendations, I’m looking for a Role Playing Game with a unique magic system, an in depth crafting system and a good story. Please mainly limit the games to ones that can be played on PC either directly or through)

1.- “Spellstorm”: Este juego de rol tiene un sistema de magia único que se basa en la manipulación de los elementos y la creación de tormentas de magia. El sistema de construcción es basado en habilidades y la historia es interesante, con una ambientación de fantasía oscura.

2.- “Infinity Engine”: Este es un sistema de juego de rol creado por “Rol Team B”, que cuenta con varios títulos de calidad. El sistema de magia es único y se basa en la manipulación de portales interdimensionales. El sistema de construcción es basado en habilidades y la historia es interesante, con una ambientación de fantasía.

3.- “Arcana Evolved”: Este juego de rol tiene un sistema de magia único que se basa en la manipulación de las leyes fundamentales del universo. El sistema de construcción es basado en habilidades y atributos, y la historia es interesante, con una ambientación de ciencia ficción.

4.- “Fate”: Este es un sistema de juego de rol que se enfoca en la narrativa y la toma de decisiones. El sistema de magia es inexistente, pero la historia es muy interesante, con una ambientación de fantasía y ciencia ficción.

Espero que estas opciones te sean de ayuda.

Je me suis séparée de ma copine et je suis très triste. Est ce que tu peux me donner des conseils?

(I broke up with my girlfriend and I am feeling so sad. Can you give me some suggestions?)

Oui bien sûr, je suis là pour vous aider! Alors, premièrement, prenez le temps de vivre votre deuil. Pas de précipitation, il vous faut vivre chaque étape, et chacune à leur rythme. Deuxièmement, n’hésitez pas à vous entourer des personnes qui vous sont chères, famille, amis, ils seront d’un grand réconfort. Troisièmement, faites quelque chose qui vous fait plaisir, pour vous changer les idées, même si cela peut sembler dérisoire, cela vous fera du bien. Quatrièmement, n’oubliez pas que ce n’est que passager, et que cela fera bientôt partie du passé. Enfin, cinquièmement, si vous en ressentez le besoin, n’hésitez pas à consulter un professionnel de la santé, psychologue ou psychiatre, ils sont là pour vous aider.

هل العمل نصف دوام فكرة جيدة لمراهقين بين عمر ١٥ و ١٩ عام

(Is working part-time a good idea for teenagers between 15 and 19 years old?)

نعم ، يمكن أن يكون العمل نصف دوام فكرة جيدة للمراهقين بين عمري 15 و 19 عامًا ، حيث يمكنهم من خلالها اكتساب خبرة العمل الأولى وتنمية مهارات العمل المهمة. بالإضافة إلى ذلك، يمكن أن يكون العمل نصف دوام فرصة جيدة للمراهقين لتعلم كيفية إدارة الوقت بشكل جيد.

ومع ذلك، من المهم أن تراعي الوظائف التي يتم شغلها خلال نصف الدوام احتياجات المراهقين وتجنب أي ضرر ممكن. بالإضافة إلى ذلك، من المهم أن يكون لدى المراهقين وقت فراغ كافٍ للاسترخاء والتعرف على أقرانهم وتنمية مهاراتهم الأخرى.

بالإضافة إلى ذلك، من المهم أن يدعم الآباء عمل المراهقين، ويشجعونهم على الاستمرار في الدراسة بنجاح، ويقدمون الدعم العاطفي لهم.

Appendix

A | PRETRAINING DATA DETAILS

We ran continuous pretraining on top of BLOOM with 2 datasets, DS1 and DS2. We constructed DS1 first to quickly materialize a high-quality dataset in order to start training as soon as possible. While we trained on DS1, we worked on gathering data for DS2. After training on DS1 for 20B tokens, we switched to DS2 and trained for an additional 7B tokens. The pre-processing procedure described in Appendix A.4 was applied to both datasets.

A.1 | DS1

DS1 is an 85B token corpus consisting of the following datasets:

mC4 non-English web data
RefinedWeb English web data
StarCoder data

A.2 | DS2

DS2 is a 2.3T token corpus consisting of the following datasets:

mC4 non-English web data
RefinedWeb English web data
StarCoder
The PILE
Finance-related text primarily from 10-Ks/10-Qs (see Appendix A.5 for further details)
Pile of Law

A.3 | MC4 LANGUAGE SELECTION

DS1 and DS2 differed in languages selected from mC4 due to higher-resource languages taking longer to preprocess. The languages included in DS2 are a strict superset of those included in DS1. We outline the language selection of each dataset in Table 2 below.

Language	DS1	DS2
Arabic	not included	included
Bengali	not included	included
Catalan	included	included
Spanish	not included	included
Basque	included	included
French	not included	included
Gujarati	included	included
Hindi	not included	included
Indonesian	included	included
Igbo	included	included
Kannada	included	included
Malayalam	included	included
Marathi	included	included
Nepali	included	included
Chewa	included	included
Punjabi	included	included
Portuguese	not included	included
Shona	included	included
Sotho	included	included
Swahili	included	included
Tamil	included	included
Telugu	included	included
Urdu	included	included
Vietnamese	not included	included
Xhosa	included	included
Yoruba	included	included
Chinese	not included	included
Zulu	included	included

A.4 | DATASET PREPROCESSING

We construct a filtered version of mC4-3.1.0, introduced by Chung et al. [2]. mC4-3.1.0 is has already been cleaned and deduplicated to some extent, following procedures described in UniMax [2], mT5 [6] and T5 [7]; we perform the following additional preprocessing steps, largely inspired by the steps used in the RefinedWeb [4] and SlimPajama [5] corpuses.

Language Filtering: we select languages present in ROOTS [3], BLOOM’s original pretraining corpus
URL Filtering: removing any sources contained in the list of websites here
NFC Normalization
Quality Filtering:
1. Removing documents less than 200 characters in length
2. Remove documents with more than 20% characters made up of digits
3. Remove documents with more than 20% characters made up of punctuation
Fuzzy deduplication
Exact deduplication

A.5 | FINANCE DATA

Our financial text data comes from 3 sources:

10-K/10-Q reports downloaded using the sec-edgar Python library
YouTube transcripts from videos containing financial keywords
arXiv papers from the Economics and Quantitative Finance fields (refer to the arXiv taxonomy here)

B | Long Sequence Pretraining Hyperparameters

Long sequence pretraining took place across 256 sockets of SambaNova SN20 chips in an 8-way tensor-parallel, 32-way data parallel setting.

Learning Rate	6e-6
Learning Rate Schedule	Flat
Weight Decay	0.1
Max gradient norm	1.0
Sequence Length	8192
Batch size	2048

C | Instruction Tuning + Alignment Hyperparameters and Details

Learning Rate	6e-6
Learning Rate Schedule	Cosine Decay
Learning Rate Warmup	0
Weight Decay	6e-7
Max gradient norm	1.0
Sequence Length	8192
Batch size	128

For the instruction-tuning and alignment stages, we apply completion-only loss masking. We also prepend the input text with <human>: , and append <bot>: to the input text in both stages. We lower the learning rate compared to BLOOMChat-v1 due to a much larger effective batch size from extending the sequence length. For the original scripts, see the original BLOOMChat data preparation repository here.

References

[1] Workshop, BigScience, et al. “Bloom: A 176b-parameter open-access multilingual language model.” arXiv preprint arXiv:2211.05100 (2022).

[2] Chung, Hyung Won, et al. “Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.” arXiv preprint arXiv:2304.09151 (2023).

[3] Laurençon, Hugo, et al. “The bigscience roots corpus: A 1.6 tb composite multilingual dataset.” Advances in Neural Information Processing Systems 35 (2022): 31809-31826.

[4] Penedo, Guilherme, et al. “The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.” arXiv preprint arXiv:2306.01116 (2023).

[5] Soboleva, Daria, et al. “SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama.” URL https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama.

[6] Xue, Linting, et al. “mT5: A massively multilingual pre-trained text-to-text transformer.” arXiv preprint arXiv:2010.11934 (2020).

[7] Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” The Journal of Machine Learning Research 21.1 (2020): 5485-5551.

[8] Press, Ofir, Noah A. Smith, and Mike Lewis. “Train short, test long: Attention with linear biases enables input length extrapolation.” arXiv preprint arXiv:2108.12409 (2021).

[9] Shaham, Uri, et al. “Scrolls: Standardized comparison over long language sequences.” arXiv preprint arXiv:2201.03533 (2022).

[10] Chen, Shouyuan, et al. “Extending context window of large language models via positional interpolation.” arXiv preprint arXiv:2306.15595 (2023).

[11] https://sambanova.ai/wp-content/uploads/2021/04/SambaNova_Accelerated-Computing-with-a-Reconfigurable-Dataflow-Architecture_Whitepaper_English.pdf

← SambaNova RDAs: Mastering Fault Management (Part 2)

SambaCoder-nsql-Llama-2-70B model →