Text 2 SQL Finetuning and metadata

kumarmb · July 29, 2024, 9:44pm

I am currently working on a text2sql project. To avoid sending a large text of metadata as a part of the prompt as context, I created a set of question and answer pairs as fine tuning training data. I included all SQL metadata along with every question and the correct SQL as the answer in the training data

My assumption was while learning the pattern of SQL for given questions and the metadata, it will “remember” the metadata and so, when I use this finetuned model for inference, I don’t have to send the metadata every time

But that doesn’t seem to be the case. Unless I give the SQL metadata, the SQL generated is not useful with generic names for objects. LLM appears to learn the patterns of SQL but doesnt remember the metadata

So the question is, whether this is the expected behavior? I tried finetuning mistral and Llama 3 but neither of them gave expected results.

Before trying OpenAI, wanted to have a little more insight.

jr.2509 · July 30, 2024, 7:17am

Hi there and welcome to the Forum!

Yes, what you are observing is expected behaviour. During the fine-tuning the model does not actually learn facts/knowledge, it merely picks up certain patterns. Hence, when you consume your fine-tuned model you indeed have to include the metadata every time just like you did in the training data.

kumarmb · July 30, 2024, 6:59pm

Thanks.

it is one of those cases where my whole metadata, created as a detailed semantic layer, comes up at about 50K to 60K tokens. I am looking at millions of tokens per day per user. Looks like I need to use some mini models to keep the costs down. But SQL generation capabilities of these mini models are not that great…

lorentzyeung · December 28, 2024, 11:29am

Do you mind to share how did you solve it in the end?

kumarmb · January 2, 2025, 11:06pm

It turned out exactly as experts mentioned here - while Finetuning was able to change the writing style of SQL to a very small extent, it didnt really help. So what I did was to create a large library of questions and SQLs (a library of SQL - Question pairs which I call cached knowledge), pre-process the new questions using a vector model and categorize the similarity scores into four groups.

Near 100 Similarity scores: Re-use SQL
Very High Similarity Scores: Re write SQL using LLM by sending a matching pair and new question
High Similarity scores: Re write. Extract the table names from top 1~3 matching SQLs using LLM (you could use SQL Parser Python libraries for this but my experience was that Python libraries, while good, are not perfect), extract the metadata only for the tables extracted (this ends up with 2k-3k tokens instead of 60k tokens), send this metadata along with user question
Low similarity: Send all of the metadata. But with LLMs themselves supporting cache, the cost has come down even with this but response runs at ~20 seconds. Hope is that over a time, we will expand the library of SQL - Question and optimize it…

The reason I am using terms like high, very high etc here is because you have to play around with these numbers. You may not even need 4 thresholds that I mentioned

lorentzyeung · January 2, 2025, 11:43pm

Thanks for the valuable experience sharing, it is indeed super helpful for building SQL related llm. Hope you don’t mind me asking how’s the result by doing all this (I can imagine it is very time-consuming)? How does the response of the model improve?

kumarmb · January 3, 2025, 3:51pm

Response of the LLM it much better when I send smaller but only relevant metadata. Since the process has changed, the overall project results - quality and accuracy of SQL generates, cost and performance - have improved by a lot provided the given metadata is precise.

kumarmb · January 7, 2025, 12:03am

When we limit the metadata or number of tables’ metadata sent to LLM, the accuracy is better and at a lower cost. At the same time, we are not sending all of our questions to LLM.

lorentzyeung · January 7, 2025, 12:08pm

yes it is exactly my experience too

mraguthu · February 1, 2025, 4:50am

@kumarmb Thank you for this great suggestion. Is that possible to share sample code? I am doing the same POC and like to play around. Thanks for your help.

kumarmb · February 6, 2025, 4:01pm

Its going to be tough since it is on my work Laptop. Let me see whether I can re-create that as a personal project and then share

Topic		Replies	Views
Seeking Advice on Fine-Tuning NLP Model for Response Generation Community gpt-35-turbo	3	129	February 24, 2025
How to Get Consistent SQL Queries from GPT-40 for the Same Input? API	9	1429	October 16, 2024
How to fine tune text to sql? API	19	9166	April 26, 2024
Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration Plugins / Actions builders gpt-4 , chatgpt , api , openai	11	4668	September 23, 2024
Text2sql architecture pattern -dynamically create/update good sql database Community chatgpt	7	629	November 6, 2024

Text 2 SQL Finetuning and metadata

Related topics