Blog

SambaLingo hits 15,000+ downloads, now integrated with Samba-CoE-v0.2

by SambaNova ML Engineering

April 8, 2024

SambaLingo, our cutting-edge multilingual language expert series, surpassed 15k downloads and is now integrated into Samba-CoE-v0.2 https://fast.snova.ai/, achieving a remarkable 280 tokens/s inference speed. SambaLingo provides high-quality, efficient multilingual experts to the open-source community.

One user commented:

"Thank you very much for providing such a great model. Could you consider training similar models for Macedonian and Albanian languages as well?"

Another user stated,

"I can even say that this is currently the best Russian model."

Congratulations from renowned experts in the field of language technology:

"As a leading researcher in the field, please allow me to congratulate you on the extraordinary language models you have recently released. I am deeply touched by the efforts and dedication you have put into supporting multiple languages, including Hungarian."

These affirmations from esteemed members of the community are the strongest recognition of SambaLingo and the driving force behind our progress.

SambaLingo's innovation can be attributed to our unique methodology. We have conducted in-depth research on how to efficiently adapt pre-trained language experts to low-resource languages and have achieved remarkable results in practice. By optimizing tokenizer efficiency, introducing new target language vocabularies, and carefully designing data mixing schemes, we have overcome challenges such as catastrophic forgetting and achieved performance superior to existing open-source models.

We have also developed a set of SambaLingo multilingual experts with a parameter scale of 70B built on top of llama 2 70b, which will achieve significant breakthroughs in task performance. These experts will be released as part of the Samba-1 Composition of Experts, as well as standalone models. We’ll publish a detailed technical report soon to share our research results and learnings with the community. This will promote SambaLingo's development and bring high-quality language model services to more users.

Acknowledgements

We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been achievable without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative.

We would like to give a special thanks to the following groups:

Meta for open sourcing LLama 2 and open sourcing FLORES-200 dataset

Nguyen et al for open sourcing CulturaX dataset

CohereAI for releasing AYA-101 and open sourcing a multilingual instruction tuning dataset

EleutherAI for their open source evaluation framework

Hugging Face-H4 team for open source the Zephyr training recipe and alignment handbook repo

← Using Mixed Precision on RDUs

Responsible AI →