Generate SQL queries combining prompt engineering and fine-tuning


My objective is to automate the generation of SQL queries when prompted with questions from business users.

To do so, I have started to use chatgpt (and similarly the openai.ChatCompletion.create function) to provide information about the the tables and steps to follow when given a business request (example: which column are cumulative or not, how to recognize whether to use one column or another when not specified by the business user etc).

Once instructions are clear, I also add some examples providing the question, the steps to follow to remove ambiguity and then the SQL query.

It is working pretty well. But I am limited in the number of examples I can give in addition to the context and explanations. in the API.
On one side I find very powerful the ability to explain which steps need to be followed when a business request arrives (through prompt engineering).
On the other side, the API limitation (5000 tokens) prevents me from providing a lot of examples that could be very beneficial for the model to learn how to handle queries (through model fine tuning).

Question is: what is the best way to combine the two?
My initial thoughts were to:

  1. Provide questions and answers and to fine-tune a model
  2. Add custom prompts based on this fine-tune model

But the drawback I see here is that the model will be fine-tuned without my guidance (what I put in the prompt on which steps should be followed to clarify the business input), which might be a pity.

What would be your advice / ideas?

1 Like

Hey @cavadeos
Have you tried embeddings,
also have you found any solution to your problem

I don’t have a perfect solution, but I have played with this idea as well. Assuming that your tables are named at least somewhat coherently. I found it most effective to give lots of examples of how I join the tables. That already takes care of the most tedious work of writing SQL queries (for me).

I could imagine having multiple API calls

  1. To determine how to join tables
  2. To extract which tables are needed from the request
  3. Match tables with schemas in code
  4. Combine all collected info into one query

Something like

  1. “From question write SQL code joining the right tables needed. Here are a bunch of example of how I joined tables in the past”
  2. “From the question extract which tables are needed to complete this request. (Here are available tables: XXXX) Return a list.”
  3. Run over the list you get in 2) with python and depending on which tables are needed find the relevant schema for those tables
  4. "You’re an assistant that helps write SQL queries. Given the Question, the answer {1} and answer {3} write a full SQL query.

Obviously ideal would be to just feed the entire schema every time, but that wouldn’t work with the current token limits, but hopefully something like this could help you be more selective.

1 Like

I think semantic search will be less accurate than sql search, as it is probabilistic.