NL to SQL, very large database: new tools?

Hi!

I am building an NL to SQL bot for my company, that should be able to query our database. The problem is that the database is huge (the section of the database that I have to query is over 100 tables).

With old gpt-3.5 context, I managed to get a working version by working in layers: I first have the bot select tables from the schema, and then provide the schema in context for the selected tables. Without fine tuning, I got a decent accuracy that was promising for finetuning.

But then the release of gpt-4-turbo happened, together with the release of GPTs and such tools. This is quite exciting, but also very confusing. I am wondering how to leverage those new tools to improve my bot (either accuracy or less tokens used, both would be improvements).

Iā€™ve seen that the GPTs and assistants are used as an ā€˜alternativeā€™ to fine-tuning a model to a wanted behaviour. In particular, one can pass documents to them, and they would take their info from the document.

Naturally, my first instinct was to ask myself ā€œShould I now pass the database schema as a file to an assistant/GPTā€? But then I did some searches, and it looks like files are sent in the context anyways, so I do not know if that would help.

Other topics suggested to vectorize the db schema and give it to openai as an embedded file. Apart from the fact that I donā€™t know how to do that (but I can research), I was under the impression that an embedded file allows more ā€œfreeformā€ interpretation from the AI. As an example, I doubt the bot would get table and field names right, and it would be an headache to correct it. Am I right?

Iā€™d like to know if you have already applied the new tools to an NL to SQL problem, with which results.

Thanks!

1 Like

I was wondering if you ever figured out how to embed the db schema. Iā€™m looking into doing that myself. Also was wondering if you ever figured out if doing that was even worth it. Iā€™m looking into options of how to feed it DB schema info when the structure is too large to send all that info at once. (Not to mention the cost / call is ridiculous if youā€™re giving the entire schema every time you want it to generate SQL for you).

1 Like

Hi,

sorry for answering this late.

I had to go around the schema problem in quite a roundabout way. I think at the end of the day I did something like Langchain does, less refined (but also less costly).

What I did was to define a file with the database schema and wrote basically two layers of the LLM: one only receives the table names and their descriptions, selecting the tables that will be used with the query. The other only receives the table schema of the selected tables. This way, I never have to communicate the entire schema (that does not fit).

This bot usually gets quite successful. You can add layers ad libitum (as an example, I have a layer sending the SQL request to a fake empty database with only the schema: if it returns with errors, then I have a correction layer of the LLM).