NL to SQL, very large database: new tools?

sbaldino · November 13, 2023, 11:07am

Hi!

I am building an NL to SQL bot for my company, that should be able to query our database. The problem is that the database is huge (the section of the database that I have to query is over 100 tables).

With old gpt-3.5 context, I managed to get a working version by working in layers: I first have the bot select tables from the schema, and then provide the schema in context for the selected tables. Without fine tuning, I got a decent accuracy that was promising for finetuning.

But then the release of gpt-4-turbo happened, together with the release of GPTs and such tools. This is quite exciting, but also very confusing. I am wondering how to leverage those new tools to improve my bot (either accuracy or less tokens used, both would be improvements).

I’ve seen that the GPTs and assistants are used as an ‘alternative’ to fine-tuning a model to a wanted behaviour. In particular, one can pass documents to them, and they would take their info from the document.

Naturally, my first instinct was to ask myself “Should I now pass the database schema as a file to an assistant/GPT”? But then I did some searches, and it looks like files are sent in the context anyways, so I do not know if that would help.

Other topics suggested to vectorize the db schema and give it to openai as an embedded file. Apart from the fact that I don’t know how to do that (but I can research), I was under the impression that an embedded file allows more “freeform” interpretation from the AI. As an example, I doubt the bot would get table and field names right, and it would be an headache to correct it. Am I right?

I’d like to know if you have already applied the new tools to an NL to SQL problem, with which results.

Thanks!

isaac.abramowitz · February 19, 2024, 7:49pm

I was wondering if you ever figured out how to embed the db schema. I’m looking into doing that myself. Also was wondering if you ever figured out if doing that was even worth it. I’m looking into options of how to feed it DB schema info when the structure is too large to send all that info at once. (Not to mention the cost / call is ridiculous if you’re giving the entire schema every time you want it to generate SQL for you).

sbaldino · February 27, 2024, 8:15am

Hi,

sorry for answering this late.

I had to go around the schema problem in quite a roundabout way. I think at the end of the day I did something like Langchain does, less refined (but also less costly).

What I did was to define a file with the database schema and wrote basically two layers of the LLM: one only receives the table names and their descriptions, selecting the tables that will be used with the query. The other only receives the table schema of the selected tables. This way, I never have to communicate the entire schema (that does not fit).

This bot usually gets quite successful. You can add layers ad libitum (as an example, I have a layer sending the SQL request to a fake empty database with only the schema: if it returns with errors, then I have a correction layer of the LLM).

Topic		Replies	Views
Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration Plugins / Actions builders gpt-4 , chatgpt , api , openai	11	3518	September 23, 2024
Developing a Custom QA Chatbot: Challenges with Training GPT-3.5-Turbo and Exploring Alternative Models and Libraries API api , langchain , chatgpt-plugin , chatbot , gpt	0	1403	November 24, 2023
Text 2 SQL - Schema too big to fit into completion API request API	1	1200	November 14, 2023
Generate SQL queries combining prompt engineering and fine-tuning API	4	7311	December 24, 2023
Ideas for teaching ChatGPT about an entire DB schema? API	6	7012	April 9, 2024

NL to SQL, very large database: new tools?

Related topics