Text 2 SQL Finetuning and metadata

I am currently working on a text2sql project. To avoid sending a large text of metadata as a part of the prompt as context, I created a set of question and answer pairs as fine tuning training data. I included all SQL metadata along with every question and the correct SQL as the answer in the training data

My assumption was while learning the pattern of SQL for given questions and the metadata, it will “remember” the metadata and so, when I use this finetuned model for inference, I don’t have to send the metadata every time

But that doesn’t seem to be the case. Unless I give the SQL metadata, the SQL generated is not useful with generic names for objects. LLM appears to learn the patterns of SQL but doesnt remember the metadata

So the question is, whether this is the expected behavior? I tried finetuning mistral and Llama 3 but neither of them gave expected results.

Before trying OpenAI, wanted to have a little more insight.

Hi there and welcome to the Forum!

Yes, what you are observing is expected behaviour. During the fine-tuning the model does not actually learn facts/knowledge, it merely picks up certain patterns. Hence, when you consume your fine-tuned model you indeed have to include the metadata every time just like you did in the training data.

1 Like

Thanks.

it is one of those cases where my whole metadata, created as a detailed semantic layer, comes up at about 50K to 60K tokens. I am looking at millions of tokens per day per user. Looks like I need to use some mini models to keep the costs down. But SQL generation capabilities of these mini models are not that great…

1 Like

Do you mind to share how did you solve it in the end?

It turned out exactly as experts mentioned here - while Finetuning was able to change the writing style of SQL to a very small extent, it didnt really help. So what I did was to create a large library of questions and SQLs (a library of SQL - Question pairs which I call cached knowledge), pre-process the new questions using a vector model and categorize the similarity scores into four groups.

  1. Near 100 Similarity scores: Re-use SQL
  2. Very High Similarity Scores: Re write SQL using LLM by sending a matching pair and new question
  3. High Similarity scores: Re write. Extract the table names from top 1~3 matching SQLs using LLM (you could use SQL Parser Python libraries for this but my experience was that Python libraries, while good, are not perfect), extract the metadata only for the tables extracted (this ends up with 2k-3k tokens instead of 60k tokens), send this metadata along with user question
  4. Low similarity: Send all of the metadata. But with LLMs themselves supporting cache, the cost has come down even with this but response runs at ~20 seconds. Hope is that over a time, we will expand the library of SQL - Question and optimize it…

The reason I am using terms like high, very high etc here is because you have to play around with these numbers. You may not even need 4 thresholds that I mentioned

3 Likes

Thanks for the valuable experience sharing, it is indeed super helpful for building SQL related llm. Hope you don’t mind me asking how’s the result by doing all this (I can imagine it is very time-consuming)? How does the response of the model improve?

Response of the LLM it much better when I send smaller but only relevant metadata. Since the process has changed, the overall project results - quality and accuracy of SQL generates, cost and performance - have improved by a lot provided the given metadata is precise.

When we limit the metadata or number of tables’ metadata sent to LLM, the accuracy is better and at a lower cost. At the same time, we are not sending all of our questions to LLM.

yes it is exactly my experience too

@kumarmb Thank you for this great suggestion. Is that possible to share sample code? I am doing the same POC and like to play around. Thanks for your help.

Its going to be tough since it is on my work Laptop. Let me see whether I can re-create that as a personal project and then share