Here’s a rephrased version of your text:
“I am developing an AI chatbot application that generates SQL queries in response to natural language inputs. The database schema I’m working with consists of 670 tables, resulting in a total of nearly 17,24,789 tokens. Due to the model’s context size limitations, I am unable to pass the entire schema with every query. I’m looking for guidance on the best approach to handle this issue effectively.”
1 Like
Hi @vinchurkarp77 and welcome to the community ! Text to SQL is a super active area of research. Just glancing at your problem, I would say you are dealing with too many tables at once for this to be effective. I would try to cluster the problem, so you would have different clusters of tables. Joins become a tricky thing here of course. My advice is to start small with 2-3 tables and test it out first. And normally you have to do a lot more then just feed the schemas, but as I said, it’s a very active area of research.
1 Like
Thank you for your interest in this. I’ve successfully tested this approach with smaller database schemas containing 4-6 tables, and it worked well. However, for such a large database schema, I’m struggling to find a suitable approach to handle it effectively.
1 Like
Hello,
I am neither an engineer nor an LLM expert. Here is my recommendation based on the text-2-sql chatbot I am trying to create.
You need to construct a prompt with relevant contextual examples, in addition to the chunk metadata. Say you have a 1,000 training question and validated SQL queries in response. Ideally, before passing the question to LLM, you would want to provide it with the most relevant examples rather than only dumping the schema/metadata information.
Here is how Vanna AI does it. Google “ai-sql-accuracy-2023-08-17 Vanna”. I don’t work for Vanna, however, I created my own method based on their proposition.
The way I handle is that I use a cosine similarity (L2 distance) to get the 3 most important chunk. I also do another search within the training dataset and it gives me the 3 most relevant question/query pair as example. I also cross check if chunks used in relevant context (example dataset) are also in relevant chunks from previous step. If not, I add them to the list of relevant chunks
Note: If you believe you may need more than 3 chunks in some cases, then change it to most relevant 5 chunks. Your training dataset will tell you how many chunks are used in 99% of the cases. You should check that.
Then I construct the prompt as usual:
Your task is to answer {QUESTION}
{SYSTEM INSTRUCTIONS}
Here are relevant metadata {METADATA}
Here are some example queries for similar questions {RELEVANT_CONTEXT}
You can specify the return format within the system instructions. It is not the most reliable method. So, I use function call to call text2sql function I created. You can ask LLM to return an answer in a specified format to ensure a consistent response. You can simply pass this function in your LLM call via tool call.
There are also other methods such as:
- asking AI if it can answer the query with relevant metadata (1 or 0) and then loop the whole similarity search until AI returns 1. And then pass the successful METADATA to LLM
- rerankers
- use ChatGPT vector store to store table information and rely on its own proprietary file-search algorithms
Please develop an ai that automatically promote my YouTube channel