Blog

SambaCoder-nsql-Llama-2-70B model

Written by SambaNova and Numbers Station ML | February 13, 2024

Using a Text-to-SQL generative AI solution can have a significant impact on enterprise organizations. This will transform how employees are able to extract information from SQL databases and will be pivotal in data analytics, simplifying query processes, empowering analysts with enhanced accessibility, facilitating collaboration among technical and non-technical teams, aligning seamlessly with self-service analytics trends, and elevating the overall efficiency of data analysis workflows.

Numbers Station and SambaNova have been looking into building LLMs that can help with these tasks to unlock developer productivity. We are excited to open source and release the artifacts of this collaboration - a SambaCoder-nsql-llama2-70B model that surpasses GPT-4! The model reaches 78.1% execution accuracy on the spider test set, which surpasses GPT-4 (76.2%). SambaCoder-nsql-Llama-2-70B was trained on RDU with mixed-precision bfloat16 with all open source datasets. We hope this work can encourage further exploration in the open source community of training models better fit for enterprise settings.

You can find a link to the model on Hugging Face and a live space for interacting with the model. If you have any further questions, feel free to ask via SambaNova’s public Discord server.

Training methodology

SambaCoder-nsql-Llama-2-70B followed a 2-stage fine tuning procedure

  1. Continuous pretraining
  2. Fine tuning

The continuous pretraining was done on top of llama2-70B model [7] . We used the SQL split of the stack dataset [8]. We trained with a max sequence length of 4096 and global batch size of 256. We used fixed learning rate of 1e-5 and trained the model for 2 epochs. 

The finetuning was done on the Numbers Station text2sql dataset [1]. We trained for 10 epochs with sequence length of 4096 and global batch size of 64. The learning rate is 1e-5 with cosine decay to 10% of the peak LR. Here, we only performed back-propagation on the output of each sample, but not the instruction part.

In both phases, we packed samples into the same min-batch if possible and dropped all the samples that don’t fit in the 4096 context window size (i.e. the greedy drop method here [2]). Also, we provided customized attention mask to avoid different samples packed in the same mini-batch to attend to each other. After data preprocessing, this results in 324k samples in phase 1 and 43k samples in phase 2.

Evaluation methodology

Quantitative

We evaluated the model on the three benchmarks: spider [5], bird [3], and sql-eva [4]. Note that though our model surpasses GPT-4 on the spider dataset, there is still a gap in the other two benchmarks. We left it for open questions to close the gap with open source effort.

  SambaCoder-nsql-Llama-2-70B GPT-4 GPT-3.5 Turbo
Spider 78.1% 76.2% 72.8%
Bird 28.16% 43.65% 37.22%
Text-sql
(average across tasks)
48.08% 81.8% 66.08%

Qualitative

To test our model’s performance in real-world use cases, we also hosted a human evaluation session where people can send requests to the model online. We used the SQLite Sample Database provided in [6]. Here we showcase some queries and responses we collected in the session.

Tables Queries Generations Comments
invoices What is the ratio of the date that has the most invoices to the date that has the least invoices SELECT (SELECT COUNT(*) FROM "invoices" GROUP BY [InvoiceDate] ORDER BY COUNT(*) DESC LIMIT 1) / (SELECT COUNT(*) FROM "invoices" GROUP BY [InvoiceDate] ORDER BY COUNT(*) LIMIT 1) Correct, it understands how to generate a ratio statement
customers Provide the customerID, First Name, Last name and "Domestic" if the country is "USA" else "Foreign" from customers table SELECT [CustomerId], [FirstName], [LastName], CASE WHEN [Country] = 'USA' THEN 'Domestic' ELSE 'Foreign' END AS "Location" FROM "customers" Correct
artists, album Provide the name of the artist who do not have any albums in the albums table SELECT [Name] FROM "artists" EXCEPT SELECT T2.Name FROM "albums" AS T1 JOIN "artists" AS T2 ON T1.ArtistId = T2.ArtistId Correct, it leverages less common operator EXCEPT
tracks Find the tracks whose names contain: zero or more characters, followed by "Br", followed by characters, followed by "wn", and followed by zero or more characters. SELECT [Name] FROM "tracks" WHERE [Name] LIKE '%Br%wn%' Correct, it generates the correct pattern based on instruction
artists, album, tracks Provide the name of the artist for any album that has a track longer than 5 minutes SELECT T2.Name FROM "albums" AS T1 JOIN "artists" AS T2 ON T1.ArtistId = T2.ArtistId JOIN "tracks" AS T3 ON T1.AlbumId = T3.AlbumId WHERE T3.Milliseconds > 5 Wrong, didn’t convert 5 minutes to 300000 milliseconds

 

Conclusion

This blogpost demonstrates that by following the outlined methodology, one can train an industry-grade SQL model utilizing an open source model and datasets. We present this methodology and resulting model as a resource, contributing to the open source community’s ongoing efforts in this domain.

Citations

[1] https://huggingface.co/datasets/NumbersStation/NSText2SQL

[2] https://github.com/sambanova/generative_data_prep

[3] https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird/llm

[4] https://github.com/defog-ai/sql-eval

[5] https://github.com/taoyds/test-suite-sql-eval/tree/master 

[6] https://www.sqlitetutorial.net/sqlite-sample-database/ 

[7] https://arxiv.org/abs/2307.09288 

[8] https://huggingface.co/datasets/bigcode/the-stack