We are proud to release BLOOMChat-v2, a 32K sequence length, 176B multilingual language model trained on top of BigScience’s BLOOM [1] model. It is the largest (by parameter count) open-source model that can be run with 32,768-length sequences. In this blogpost, we present in-depth training details, as well as extensive evaluations and comparisons to the base checkpoints. We show that BLOOMChat-v2 achieves up to 170% improvement over BLOOM on SCROLLS [9], a long-sequence benchmark. BLOOMChat-v2 was trained on RDU [11] with mixed-precision bfloat16.
BLOOMChat-v2 model card: https://huggingface.co/sambanovasystems/BLOOMChat-176B-v2
For example training and inference scripts, check out our BLOOMChat GitHub repository: https://github.com/sambanova/bloomchat
If you have any further questions, feel free to ask via our public Discord server!
BLOOMChat-v2 followed a 3-stage finetuning procedure:
Stage 1 serves to extend the model’s native maximum sequence length beyond 2048. Stages 2 and 3 follow a similar protocol to BLOOMChat-v1 (see blogpost here), and serve as chat-alignment steps. For all stages, we pack data into sequences of 8192.
In stage 1, we continuously pretrained BLOOM on approximately 30B tokens of multilingual data packed into sequences of length 8192. We refer to the resulting model after stage 1 BLOOM-LSS, which we evaluate on SCROLLS below.
The stage 1 data was a mixture of crawled web data and curated high-quality text that covered a subset of the languages in ROOTS [3], BLOOM’s original pretraining dataset. For further details on dataset preprocessing and the makeup of the long sequence dataset mixture, see Appendix A and Appendix B respectively.
In stage 2, we train BLOOM-LSS on the OIG dataset for 1 epoch. Inputs are packed into sequences of length 8192, and we apply completion-only loss masking. We also apply the <human>: and <bot>: chat templates to the input and completion texts respectively.
See Appendix C for further training hyperparameters.
In stage 3, we train on the combined databricks-dolly-15k and oasst1 datasets for 16 epochs. As in the previous stage, inputs are packed into sequences of length 8192, we apply completion-only loss masking, and we apply the <human>: and <bot>: chat templates. See Appendix C for further training hyperparameters
BLOOMChat-v2 is the resulting model after stage 3.
All 3 stages were run in mixed-precision bfloat16 with float32 master weights on RDU. We ensured that ALiBi [8] and multi-head attention operations were performed in FP32. These choices are motivated by the observations in our previous blogpost on the relationship between ALiBi and lower precision data formats. Intuitively, we want to reap the throughput benefits of training in lower precision, whilst maintaining the integrity of training on long sequences.
To verify our intuitions, we perform a small-scale ablation where we continuously pretrain BLOOM-560M (the smallest variant) on the same long-sequence dataset as in stage 1. We run two experiments: one entirely in BF16, and the other with multi-head attention and ALiBi in FP32 (with the rest of the model’s operations in BF16). In the plots below, we call these models bf16-MHA and fp32-MHA respectively.
In Figure 1, we plot perplexity on a held-out validation dataset, with sequences filtered to be of length 4000 or greater. The x-axis denotes sequence position, and the y-axis denotes the average perplexity of that position across the held-out dataset.
Our fine-tuned models (orange and green) both clearly outperform the base model in blue. However, we can see that in the later sequence positions, the bf16-MHA model has higher perplexity. This suggests that positional information may be less effective for higher sequences in lower precision. Thus, for all large-scale long-sequence training, we fix the precision of multi-head attention and ALiBi to be in FP32.
We evaluate BLOOM and BLOOM-LSS on SCROLLS [9], a collection of long-sequence natural language benchmarks including summarization and question-answering tasks. We use the Eleuther-AI lm-evaluation-harness to run our evaluations in a zero-shot manner, with a maximum sequence length of 8192. We do not report on ContractNLI, which we observed to have inconsistencies with publicly reported numbers. BLOOM-LSS outperforms BLOOM on every benchmark, achieving 170% improvement on average.
We additionally evaluate BLOOM-LSS on SCROLLS with a minimum sequence length of 8,192 and a maximum sequence length of 32,768. In these evaluations, all input sequences are longer than those seen during pretraining for both models.
We show results both with and without positional interpolation (PI), as described in [10]. For results with PI, we interpolate from each models maximum pretraining sequence length (2048 for BLOOM, 8192 for BLOOM-LSS) to 32,768.
We observe in Figure 3 that BLOOM and BLOOM-LSS both underperform their PI’d counterparts. This is particularly the case for summarization benchmarks like GovRep, QMSum, and SumScreenFD. Overall, BLOOM-LSS with PI attains the highest scores on all tasks.
We re-sample multilingual generations from BLOOMChat-v2 to verify that our long sequence finetuning procedure has maintained the multilingual chat abilities of BLOOMChat-v1. Note that these generations are sampled with the same <human>: and <bot>: tags that the model was finetuned with.
We ran continuous pretraining on top of BLOOM with 2 datasets, DS1 and DS2. We constructed DS1 first to quickly materialize a high-quality dataset in order to start training as soon as possible. While we trained on DS1, we worked on gathering data for DS2. After training on DS1 for 20B tokens, we switched to DS2 and trained for an additional 7B tokens. The pre-processing procedure described in Appendix A.4 was applied to both datasets.
DS1 is an 85B token corpus consisting of the following datasets:
DS2 is a 2.3T token corpus consisting of the following datasets:
DS1 and DS2 differed in languages selected from mC4 due to higher-resource languages taking longer to preprocess. The languages included in DS2 are a strict superset of those included in DS1. We outline the language selection of each dataset in Table 2 below.
Language | DS1 | DS2 |
Arabic | not included | included |
Bengali | not included | included |
Catalan | included | included |
Spanish | not included | included |
Basque | included | included |
French | not included | included |
Gujarati | included | included |
Hindi | not included | included |
Indonesian | included | included |
Igbo | included | included |
Kannada | included | included |
Malayalam | included | included |
Marathi | included | included |
Nepali | included | included |
Chewa | included | included |
Punjabi | included | included |
Portuguese | not included | included |
Shona | included | included |
Sotho | included | included |
Swahili | included | included |
Tamil | included | included |
Telugu | included | included |
Urdu | included | included |
Vietnamese | not included | included |
Xhosa | included | included |
Yoruba | included | included |
Chinese | not included | included |
Zulu | included | included |
We construct a filtered version of mC4-3.1.0, introduced by Chung et al. [2]. mC4-3.1.0 is has already been cleaned and deduplicated to some extent, following procedures described in UniMax [2], mT5 [6] and T5 [7]; we perform the following additional preprocessing steps, largely inspired by the steps used in the RefinedWeb [4] and SlimPajama [5] corpuses.
Our financial text data comes from 3 sources:
Long sequence pretraining took place across 256 sockets of SambaNova SN20 chips in an 8-way tensor-parallel, 32-way data parallel setting.
Learning Rate | 6e-6 |
Learning Rate Schedule | Flat |
Weight Decay | 0.1 |
Max gradient norm | 1.0 |
Sequence Length | 8192 |
Batch size | 2048 |
Learning Rate | 6e-6 |
Learning Rate Schedule | Cosine Decay |
Learning Rate Warmup | 0 |
Weight Decay | 6e-7 |
Max gradient norm | 1.0 |
Sequence Length | 8192 |
Batch size | 128 |
For the instruction-tuning and alignment stages, we apply completion-only loss masking. We also prepend the input text with <human>: , and append <bot>: to the input text in both stages. We lower the learning rate compared to BLOOMChat-v1 due to a much larger effective batch size from extending the sequence length. For the original scripts, see the original BLOOMChat data preparation repository here.
[1] Workshop, BigScience, et al. “Bloom: A 176b-parameter open-access multilingual language model.” arXiv preprint arXiv:2211.05100 (2022).
[2] Chung, Hyung Won, et al. “Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.” arXiv preprint arXiv:2304.09151 (2023).
[3] Laurençon, Hugo, et al. “The bigscience roots corpus: A 1.6 tb composite multilingual dataset.” Advances in Neural Information Processing Systems 35 (2022): 31809-31826.
[4] Penedo, Guilherme, et al. “The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.” arXiv preprint arXiv:2306.01116 (2023).
[5] Soboleva, Daria, et al. “SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama.” URL https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama.
[6] Xue, Linting, et al. “mT5: A massively multilingual pre-trained text-to-text transformer.” arXiv preprint arXiv:2010.11934 (2020).
[7] Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” The Journal of Machine Learning Research 21.1 (2020): 5485-5551.
[8] Press, Ofir, Noah A. Smith, and Mike Lewis. “Train short, test long: Attention with linear biases enables input length extrapolation.” arXiv preprint arXiv:2108.12409 (2021).
[9] Shaham, Uri, et al. “Scrolls: Standardized comparison over long language sequences.” arXiv preprint arXiv:2201.03533 (2022).
[10] Chen, Shouyuan, et al. “Extending context window of large language models via positional interpolation.” arXiv preprint arXiv:2306.15595 (2023).