There’s not enough information to assess your approach. Are you using LangChain? If you aren’t, I recommend you do because it will help you understand if you are doing this right. Any deeper inspection would require a put you on a consulting clock, and you don’t want that.
I am actually doing something similar myself, but because I’ve only a couple of tables, I am able to send them in the system message.
Your approach is sound. I believe, beyond the schema, you will need to describe the overall functionally of each table as well as the specific functionality of each field. And, since you are using embeddings, I would be as detailed about this as possible. Why? Because it’s the vector store that will need to retrieve the correct tables based upon the user’s request. The better your table descriptions, the more likely it will retrieve the correct table schemas. The LLM creating the SQL is the easy part – if it has the requisite table info.
Also, I would keep tables that you expect to be used together frequently as close together in your embeddings as you can (same document). That will help things along as well.
Thank you for your reply. Yes consulting clock wont be affordable for me as of now
But can you please help me validate the approach? I am using llama index for this. If langchain has better functionality, then I dont mind using langchain.
So, in each call to the model, I send a system message like this:
Given the following SQL tables, your job is to write an SQL query given the user’s request. You may not write queries that contain CREATE, INSERT, UPDATE, DROP, DELETE, ALTER, RENAME, TRUNCATE or any command that alters the database structure or data in any way. Only return one SQL statement. No comments. There are only two tables: solrai_log uses the logUser field and solrai_user uses the username field. Only these two tables ae to be used. Also remember that a ‘query’ is a logType. So, the term ‘queries’ always refers to logType. Use the field titles for headers instead of the field names, but use the field names for the commands.
solrai_log table
logId: int unsigned, 10
title: Log ID
description: The unique integer that identifies this log record.
logUser: varchar, 25
title: Username
description: The user account associated with this log record. “system” means this record was created by a system process.
logIp: varchar, 25
title: IP
description: The IP address of the person or entity generating this log record. Generally not applicable to “system” processes.
etc…
Only return an SQL statement. No comments. Generate response ONLY in SQL. Return no text other than the executable SQL.
Totally not true. Well, in my world, totally not true. Can’t speak for anyone else.
These calls are only made through a read only connection. I also did not mention that I have a second system message that applies after the SQL statement is created. Just wanted to give a simple example here.
I built a Bot using GPT3.5 for the same purpose. For small number of tables, it provided very high accuracy. In your case, I would work on the data side. If your data is from OLTP, ETL the data to a DWH that’s more suitable for data analysis. Build data marts for business function areas. The Bot checks the data marts using business logics in the back end rather than the entire tables. This approach reduces your number of tables dramatically. To further cut down the number of tables, use views to join some tables in a data mart and have the Bot query views.