BLOOMChat-v2 Long Sequences at 176B

Written by SambaNova ML Engineering | February 7, 2024

We are proud to release BLOOMChat-v2, a 32K sequence length, 176B multilingual language model trained on top of BigScience’s BLOOM [1] model. It is the largest (by parameter count) open-source model that can be run with 32,768-length sequences. In this blogpost, we present in-depth training details, as well as extensive evaluations and comparisons to the base checkpoints. We show that BLOOMChat-v2 achieves up to 170% improvement over BLOOM on SCROLLS [9], a long-sequence benchmark. BLOOMChat-v2 was trained on RDU [11] with mixed-precision bfloat16.

BLOOMChat-v2 model card: https://huggingface.co/sambanovasystems/BLOOMChat-176B-v2

For example training and inference scripts, check out our BLOOMChat GitHub repository: https://github.com/sambanova/bloomchat

If you have any further questions, feel free to ask via our public Discord server!

Finetuning Details

BLOOMChat-v2 followed a 3-stage finetuning procedure:

Long sequence pretraining
Instruction tuning
Alignment

Stage 1 serves to extend the model’s native maximum sequence length beyond 2048. Stages 2 and 3 follow a similar protocol to BLOOMChat-v1 (see blogpost here), and serve as chat-alignment steps. For all stages, we pack data into sequences of 8192.

STAGE 1: LONGE SEQUENCE PRETRAINING

In stage 1, we continuously pretrained BLOOM on approximately 30B tokens of multilingual data packed into sequences of length 8192. We refer to the resulting model after stage 1 BLOOM-LSS, which we evaluate on SCROLLS below.

The stage 1 data was a mixture of crawled web data and curated high-quality text that covered a subset of the languages in ROOTS [3], BLOOM’s original pretraining dataset. For further details on dataset preprocessing and the makeup of the long sequence dataset mixture, see Appendix A and Appendix B respectively.

STAGE 2: INSTRUCTION TUNING

In stage 2, we train BLOOM-LSS on the OIG dataset for 1 epoch. Inputs are packed into sequences of length 8192, and we apply completion-only loss masking. We also apply the <human>: and <bot>: chat templates to the input and completion texts respectively.

See Appendix C for further training hyperparameters.

STAGE 3: ALIGNMENT

In stage 3, we train on the combined databricks-dolly-15k and oasst1 datasets for 16 epochs. As in the previous stage, inputs are packed into sequences of length 8192, we apply completion-only loss masking, and we apply the <human>: and <bot>: chat templates. See Appendix C for further training hyperparameters

BLOOMChat-v2 is the resulting model after stage 3.

PRECISION

All 3 stages were run in mixed-precision bfloat16 with float32 master weights on RDU. We ensured that ALiBi [8] and multi-head attention operations were performed in FP32. These choices are motivated by the observations in our previous blogpost on the relationship between ALiBi and lower precision data formats. Intuitively, we want to reap the throughput benefits of training in lower precision, whilst maintaining the integrity of training on long sequences.

To verify our intuitions, we perform a small-scale ablation where we continuously pretrain BLOOM-560M (the smallest variant) on the same long-sequence dataset as in stage 1. We run two experiments: one entirely in BF16, and the other with multi-head attention and ALiBi in FP32 (with the rest of the model’s operations in BF16). In the plots below, we call these models bf16-MHA and fp32-MHA respectively.

In Figure 1, we plot perplexity on a held-out validation dataset, with sequences filtered to be of length 4000 or greater. The x-axis denotes sequence position, and the y-axis denotes the average perplexity of that position across the held-out dataset.

Our fine-tuned models (orange and green) both clearly outperform the base model in blue. However, we can see that in the later sequence positions, the bf16-MHA model has higher perplexity. This suggests that positional information may be less effective for higher sequences in lower precision. Thus, for all large-scale long-sequence training, we fix the precision of multi-head attention and ALiBi to be in FP32.

Evaluations

LONG SEQUENCE TASKS AT 8K

We evaluate BLOOM and BLOOM-LSS on SCROLLS [9], a collection of long-sequence natural language benchmarks including summarization and question-answering tasks. We use the Eleuther-AI lm-evaluation-harness to run our evaluations in a zero-shot manner, with a maximum sequence length of 8192. We do not report on ContractNLI, which we observed to have inconsistencies with publicly reported numbers. BLOOM-LSS outperforms BLOOM on every benchmark, achieving 170% improvement on average.

LONG SEQUENCE TASKS AT 32K (W/ INTERPOLATION)

We additionally evaluate BLOOM-LSS on SCROLLS with a minimum sequence length of 8,192 and a maximum sequence length of 32,768. In these evaluations, all input sequences are longer than those seen during pretraining for both models.

We show results both with and without positional interpolation (PI), as described in [10]. For results with PI, we interpolate from each models maximum pretraining sequence length (2048 for BLOOM, 8192 for BLOOM-LSS) to 32,768.

We observe in Figure 3 that BLOOM and BLOOM-LSS both underperform their PI’d counterparts. This is particularly the case for summarization benchmarks like GovRep, QMSum, and SumScreenFD. Overall, BLOOM-LSS with PI attains the highest scores on all tasks.

Sample Multilingual Generations

We re-sample multilingual generations from BLOOMChat-v2 to verify that our long sequence finetuning procedure has maintained the multilingual chat abilities of BLOOMChat-v1. Note that these generations are sampled with the same <human>: and <bot>: tags that the model was finetuned with.

Appendix

A | PRETRAINING DATA DETAILS

We ran continuous pretraining on top of BLOOM with 2 datasets, DS1 and DS2. We constructed DS1 first to quickly materialize a high-quality dataset in order to start training as soon as possible. While we trained on DS1, we worked on gathering data for DS2. After training on DS1 for 20B tokens, we switched to DS2 and trained for an additional 7B tokens. The pre-processing procedure described in Appendix A.4 was applied to both datasets.

A.1 | DS1

DS1 is an 85B token corpus consisting of the following datasets:

mC4 non-English web data
RefinedWeb English web data
StarCoder data

A.2 | DS2

DS2 is a 2.3T token corpus consisting of the following datasets:

mC4 non-English web data
RefinedWeb English web data
StarCoder
The PILE
Finance-related text primarily from 10-Ks/10-Qs (see Appendix A.5 for further details)
Pile of Law

A.3 | MC4 LANGUAGE SELECTION

DS1 and DS2 differed in languages selected from mC4 due to higher-resource languages taking longer to preprocess. The languages included in DS2 are a strict superset of those included in DS1. We outline the language selection of each dataset in Table 2 below.

Language	DS1	DS2
Arabic	not included	included
Bengali	not included	included
Catalan	included	included
Spanish	not included	included
Basque	included	included
French	not included	included
Gujarati	included	included
Hindi	not included	included
Indonesian	included	included
Igbo	included	included
Kannada	included	included
Malayalam	included	included
Marathi	included	included
Nepali	included	included
Chewa	included	included
Punjabi	included	included
Portuguese	not included	included
Shona	included	included
Sotho	included	included
Swahili	included	included
Tamil	included	included
Telugu	included	included
Urdu	included	included
Vietnamese	not included	included
Xhosa	included	included
Yoruba	included	included
Chinese	not included	included
Zulu	included	included

A.4 | DATASET PREPROCESSING

We construct a filtered version of mC4-3.1.0, introduced by Chung et al. [2]. mC4-3.1.0 is has already been cleaned and deduplicated to some extent, following procedures described in UniMax [2], mT5 [6] and T5 [7]; we perform the following additional preprocessing steps, largely inspired by the steps used in the RefinedWeb [4] and SlimPajama [5] corpuses.

Language Filtering: we select languages present in ROOTS [3], BLOOM’s original pretraining corpus
URL Filtering: removing any sources contained in the list of websites here
NFC Normalization
Quality Filtering:
1. Removing documents less than 200 characters in length
2. Remove documents with more than 20% characters made up of digits
3. Remove documents with more than 20% characters made up of punctuation
Fuzzy deduplication
Exact deduplication

A.5 | FINANCE DATA

Our financial text data comes from 3 sources:

10-K/10-Q reports downloaded using the sec-edgar Python library
YouTube transcripts from videos containing financial keywords
arXiv papers from the Economics and Quantitative Finance fields (refer to the arXiv taxonomy here)

B | Long Sequence Pretraining Hyperparameters

Long sequence pretraining took place across 256 sockets of SambaNova SN20 chips in an 8-way tensor-parallel, 32-way data parallel setting.

Learning Rate	6e-6
Learning Rate Schedule	Flat
Weight Decay	0.1
Max gradient norm	1.0
Sequence Length	8192
Batch size	2048

C | Instruction Tuning + Alignment Hyperparameters and Details

Learning Rate	6e-6
Learning Rate Schedule	Cosine Decay
Learning Rate Warmup	0
Weight Decay	6e-7
Max gradient norm	1.0
Sequence Length	8192
Batch size	128

For the instruction-tuning and alignment stages, we apply completion-only loss masking. We also prepend the input text with <human>: , and append <bot>: to the input text in both stages. We lower the learning rate compared to BLOOMChat-v1 due to a much larger effective batch size from extending the sequence length. For the original scripts, see the original BLOOMChat data preparation repository here.

References

[1] Workshop, BigScience, et al. “Bloom: A 176b-parameter open-access multilingual language model.” arXiv preprint arXiv:2211.05100 (2022).

[2] Chung, Hyung Won, et al. “Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.” arXiv preprint arXiv:2304.09151 (2023).

[3] Laurençon, Hugo, et al. “The bigscience roots corpus: A 1.6 tb composite multilingual dataset.” Advances in Neural Information Processing Systems 35 (2022): 31809-31826.

[4] Penedo, Guilherme, et al. “The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.” arXiv preprint arXiv:2306.01116 (2023).

[5] Soboleva, Daria, et al. “SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama.” URL https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama.

[6] Xue, Linting, et al. “mT5: A massively multilingual pre-trained text-to-text transformer.” arXiv preprint arXiv:2010.11934 (2020).

[7] Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” The Journal of Machine Learning Research 21.1 (2020): 5485-5551.

[8] Press, Ofir, Noah A. Smith, and Mike Lewis. “Train short, test long: Attention with linear biases enables input length extrapolation.” arXiv preprint arXiv:2108.12409 (2021).

[9] Shaham, Uri, et al. “Scrolls: Standardized comparison over long language sequences.” arXiv preprint arXiv:2201.03533 (2022).

[10] Chen, Shouyuan, et al. “Extending context window of large language models via positional interpolation.” arXiv preprint arXiv:2306.15595 (2023).

[11] https://sambanova.ai/wp-content/uploads/2021/04/SambaNova_Accelerated-Computing-with-a-Reconfigurable-Dataflow-Architecture_Whitepaper_English.pdf

View full post