While large language models such as Llama 2 have gained widespread popularity, there remains a wide gap in their capabilities between English and other languages. To combat this, models like BLOOM [42], XGLM [43], and AYA [6] have been trained to be multilingual; however, their performance in other languages still falls short of the state-of-the-art standards. Consequently, the majority of the world is left without access to high quality open source AI models in their native tongue. This work shows that English centric language models such as Llama 2 can be adapted to any new language and outperform all existing open source multilingual models on a majority of benchmarks. Additionally, we develop a recipe for aligning the adapted checkpoints for effective responses to user queries in the adapted language, leveraging human preference data. Our results demonstrate a preference for our models' responses over open-source alternatives, and we welcome everyone to try these models by visiting SambaLingo-chat-space.
We measure the models’ capability on the new languages with a mix of canonical multilingual NLP benchmarks, including evaluation perplexity, translation, question answering, text classification, and natural language understanding. We do not include AYA-101 [6] as a baseline in these benchmarks because it is an instruction tuned checkpoint, and many of the benchmarks are contaminated in its training data. Also, to test the English capability of the model after bilingual training, we evaluate the models on OpenLLM Leaderboard [23]. Lastly, for the chat version of the models, we test their ability with prompt datasets written in the native language and use GPT-4 as a judge.
We report the “perplexity” on a holdout set of training text, Wikipedia [21] and a random sample of MC4 [22]. Evaluating perplexity on all three datasets has approximately the same result, below we show the perplexity on the holdout training data. All evaluation is done with EleutherAI’s lm-eval-harness [13]. Our models achieve state of the art perplexity compared to every open source baseline.
All SambaLingo expert models are bi-lingual (English and the expert language), and they exhibit state of the art performance on translation tasks. We evaluate our base pretrained checkpoints on the FLORES-200 dataset [15] with 8 shot evaluation using the ‘{IN}={OUT}’ prompt as recommended by Zhu et al [14].
Open Source Expert Baselines: Japanese: ELYZA-japanese-Llama-2-7b [33], Thai: typhoon-7b [34], Arabic : jais-13b [36], Hungarian: NYTK/PULI-GPTrio [35], Russian: saiga_mistral_7b_merged [37], Turkish: TURNA [38], Bulgarian: mGPT-1.3B-bulgarian [39], Serbian: sr-gpt2 [40], Slovenian: sl-gpt2 [41]
We measure the model's ability to answer multiple choice questions using the BELEBELE dataset [16]. We perform 3 shot evaluation on our base checkpoints with the default prompt from the BELEBELE github repo [17] and select the answer with the lowest perplexity.
Text classification is a task that asks the model to categorize a piece of text. We evaluate the base pre-trained language experts on the SIB-200 benchmark using the prompt recommended by Lin et al [18] with a 3 shot context and select the answer with the lowest perplexity.
Natural language understanding helps show the model's ability to grasp the meaning, context, and nuances inherent in human language. We evaluate using XNLI [47], XWinograd [49], PAWS-X [51], XCOPA [48], XStoryCloze [43] multilingual benchmarks from lm-eval-harness [13] with 0 shot context.
mGPT-13B | BLOOM | XGLM | Open Source Expert | SambaLingo Expert | |
Arabic | 0.516 | 0.585 | 0.562 | 0.633 | 0.662 |
Russian | 0.594 | 0.527 | 0.562 | 0.690 | 0.717 |
mGPT-13B | BLOOM | XGLM | Open Source Expert | SambaLingo Expert | |
Japanese | 0.578 | 0.589 | 0.650 | 0.776 | 0.766 |
Russian | 0.600 | 0.571 | 0.632 | 0.667 | 0.692 |
mGPT-13B | BLOOM | XGLM | Open Source Expert | SambaLingo Expert | |
Thai | 0.528 | 0.554 | 0.594 | 0.606 | 0.614 |
Turkish | 0.568 | 0.512 | 0.584 | 0.558 | 0.694 |
mGPT-13B | BLOOM | XGLM | Open Source Expert | SambaLingo Expert | |
Japanese | 0.452 | 0.454 | 0.520 | 0.505 | 0.468 |
mGPT-13B | BLOOM | XGLM | Open Source Expert | SambaLingo Expert | |
Thai | 0.392 | 0.349 | 0.437 | 0.430 | 0.447 |
Arabic | 0.334 | 0.338 | 0.334 | 0.363 | 0.336 |
Russian | 0.454 | 0.426 | 0.334 | 0.498 | 0.353 |
Turkish | 0.387 | 0.350 | 0.462 | 0.384 | 0.339 |
Bulgarian | 0.458 | 0.394 | 0.449 | 0.338 | 0.428 |
Open Source Expert Baselines: Japanese: ELYZA-japanese-Llama-2-7b [33], Thai: typhoon-7b [34], Arabic : jais-13b [36], Hungarian: NYTK/PULI-GPTrio [35], Russian: saiga_mistral_7b_merged [37], Turkish: TURNA [38], Bulgarian: mGPT-1.3B-bulgarian [39], Serbian: sr-gpt2 [40], Slovenian: sl-gpt2 [41]
In order to test how our models retrain their English capabilities after being adapted to new languages, we evaluate them on the standard OpenLLM leaderboard benchmarks [23]. We followed the same few-shot numbers in the OpenLLM leaderboard repository. We find that there are regressions on our model compared to base Llama, but our models still retain their ability to outperform existing multilingual model baselines in English.
In order to test SambaLingo-Chat’s ability to generate high quality responses to real user prompts, we measure the win rate with GPT4 as a judge [19][20]. We test SambaLingo-Chat’s ability on Arabic, Japanese, and Turkish and then evaluate the win rate against the best open source models for those languages. We did not cherry pick the prompts or models we present in this section.
For Arabic, we compare against aya-101 [6], Jais-13b-chat [7], and Bloomchat-v1 [8]. We used the prompts from x-self-instruct-seed-32 [10] and xOA22 [11]. Sambalingo-Arabic-Chat reaches 87.96% win rate compared to Jais-13B-chat, 99.06% win rate compared to Aya101, and 68.52% compared to Bloomchat-v1.
For Japanese, we compare against ELYZA-japanese-Llama-2-7b-instruct [5]. We randomly sampled 100 prompts from the training set of aya_dataset [9]. Sambalingo-Japanese-Chat reaches a 53.5% win rate in the comparison.
All of our models are continuously pretrained from the Llama 2 base model [12]. We run continuous pre-training for a total of 400 billion tokens across all the language experts, accelerated by SambaNova’s RDUs [24].
The Llama tokenizer is an English centric tokenizer, which means that it will not efficiently tokenize text in other languages. Previous work [25, 27] has shown that continuously pretrained models can learn newly added tokens. A tokenizer with added tokens allows more text to be packed into fewer tokens, so this gives our model improved training/inference efficiency and a longer effective sequence length. The plot below shows how the tokenizer fertility (average number of tokens per “word”) [28] improves as more tokens are added in each language. In some languages, such as Thai, it improves by as much as 4x.
We run a training ablation with the above methods for 20 million tokens, and find that initializing by averaging the token sub-words has the lowest training loss, so we initialize all of our new token embeddings using this method. We further find that it helps to initialize the LM head embeddings in the same fashion, as Llama 2 does not tie its token embedding and LM head weights.
All pre-training is done on the cultura-X dataset [26]. We mix the data to be 75% data from the language we are adapting to, and 25% English as suggested by Csaki et al [25]. We pack the data into sequences of length 4096, and ensure that when learning a token we only attend to previous tokens in the context of the corresponding text document. We train with a global batch size of 1024, sequence length of 4096, maximum learning rate of 1e-4 with cosine decay, warmup ratio of 0.01 and a weight decay of 0.1. We train each expert for up to 4 epochs [32], but as there is a varying amount of data for each language, we do not reach 4 epochs for most training runs.
The alignment phase follows the recipe for Zephyr-7B [1], and comprises two stages: supervised fine-tuning (SFT) and Direct Performance Optimization (DPO) [2].
The SFT phase was done on the ultrachat_200k dataset [3] mixed with the Google translated version of the ultrachat_200k dataset. It was trained for one epoch with global batch size 512 and max sequence length 2048 tokens. We used a linear decay learning rate of 2e-5 and 10% warmup.
The DPO phase was done on the ultrafeedback dataset [4] and cai-conversation-harmless dataset [50], mixed with 10% of the data Google translated. It was trained with global batch size 32 and for three epochs. We used a linear decay learning rate of 5e-7, 10% warmup and β=0.1 as the regularization factor for DPO.
We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been achievable without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative.
We would like to give a special thanks to the following groups