Nl to sql query creation using gpt-40-mini model

vinchurkarp77 · February 6, 2025, 8:29am

Here’s a rephrased version of your text:

“I am developing an AI chatbot application that generates SQL queries in response to natural language inputs. The database schema I’m working with consists of 670 tables, resulting in a total of nearly 17,24,789 tokens. Due to the model’s context size limitations, I am unable to pass the entire schema with every query. I’m looking for guidance on the best approach to handle this issue effectively.”

platypus · February 6, 2025, 8:54am

Hi @vinchurkarp77 and welcome to the community ! Text to SQL is a super active area of research. Just glancing at your problem, I would say you are dealing with too many tables at once for this to be effective. I would try to cluster the problem, so you would have different clusters of tables. Joins become a tricky thing here of course. My advice is to start small with 2-3 tables and test it out first. And normally you have to do a lot more then just feed the schemas, but as I said, it’s a very active area of research.

vinchurkarp77 · February 6, 2025, 9:23am

Thank you for your interest in this. I’ve successfully tested this approach with smaller database schemas containing 4-6 tables, and it worked well. However, for such a large database schema, I’m struggling to find a suitable approach to handle it effectively.

ybabahanoglu · March 31, 2025, 10:54pm

Hello,

I am neither an engineer nor an LLM expert. Here is my recommendation based on the text-2-sql chatbot I am trying to create.

You need to construct a prompt with relevant contextual examples, in addition to the chunk metadata. Say you have a 1,000 training question and validated SQL queries in response. Ideally, before passing the question to LLM, you would want to provide it with the most relevant examples rather than only dumping the schema/metadata information.

Here is how Vanna AI does it. Google “ai-sql-accuracy-2023-08-17 Vanna”. I don’t work for Vanna, however, I created my own method based on their proposition.

The way I handle is that I use a cosine similarity (L2 distance) to get the 3 most important chunk. I also do another search within the training dataset and it gives me the 3 most relevant question/query pair as example. I also cross check if chunks used in relevant context (example dataset) are also in relevant chunks from previous step. If not, I add them to the list of relevant chunks

Note: If you believe you may need more than 3 chunks in some cases, then change it to most relevant 5 chunks. Your training dataset will tell you how many chunks are used in 99% of the cases. You should check that.

Then I construct the prompt as usual:

Your task is to answer {QUESTION}
{SYSTEM INSTRUCTIONS}
Here are relevant metadata {METADATA}
Here are some example queries for similar questions {RELEVANT_CONTEXT}

You can specify the return format within the system instructions. It is not the most reliable method. So, I use function call to call text2sql function I created. You can ask LLM to return an answer in a specified format to ensure a consistent response. You can simply pass this function in your LLM call via tool call.

There are also other methods such as:

asking AI if it can answer the query with relevant metadata (1 or 0) and then loop the whole similarity search until AI returns 1. And then pass the successful METADATA to LLM
rerankers
use ChatGPT vector store to store table information and rely on its own proprietary file-search algorithms

ataahossen123 · April 2, 2025, 5:25am

Please develop an ai that automatically promote my YouTube channel

Topic		Replies	Views
Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration Plugins / Actions builders gpt-4 , chatgpt , api , openai	11	4718	September 23, 2024
Text to SQL generation API	14	12342	December 24, 2023
Generate SQL queries combining prompt engineering and fine-tuning API	4	8916	December 24, 2023
How to fine tune text to sql? API	19	9180	April 26, 2024
Natural Language to SQL with huge table schema API	12	9382	December 19, 2023

Nl to sql query creation using gpt-40-mini model

Related topics