Natural Language to SQL with huge table schema

Hi there, I have been researching on this natural language to SQL topic. But as seen from the SQL translate example we can pass the table schema and get the query output and it works pretty well. But what if there’s more than 100 tables ? I cannot pass it every time to the request. I have also seen that its currently not possible to train codex. If we can do this with davinci, How should I fine tune model so it remembers my schema ?. Have anyone ever done this ?.

2 Likes

I’m keen to know the best practise for this too. Still exploring this myself.
I have stumbled across this article, Natural Language to SQL from Scratch with Tensorflow | by Eileen Pangu | Towards Data Science

and also seen this service which has implemented this. NLSQL RPA BI | Natural Language to SQL

I’m experimenting, have some success, but I’m new to machine learning/AI, would also appreciate some guidance. The databases I would be looking at would have 100’s of tables and 1000’s of columns, I have the same issue as you with prompt lengths at the moment.

2 Likes

Hi @Tim007, thanks for sharing the articles. If you only have access to a Rest API, are you able to extract the desired data?

Has anyone in the OpenAI community or elsewhere found a scalable solution to handle natural language to SQL conversion with a large number of tables (more than 100) without the need to pass the entire table schema in each API request? Additionally, are there any best practices or techniques that have been identified to effectively use text-to-SQL services for large database schemas while still ensuring accurate and reliable SQL query generation without token limit exceed error?

1 Like

Hi and welcome to the developer forum!

The GPT Models are stateless, meaning that they have no prior knowledge of your input, so you must include all context with every message.

Have you tried the GPT-3.5-Turbo-16K model with it’s increased context size for larger context SQL query generation?

Thank you for reaching out. In response to your query, for my situation, as my database comprises a substantial number of tables, totaling 466 in count. Furthermore, the schema itself spans an extensive 122,651 tokens, with each character considered as an individual token.

While investigating potential strategies to effectively harness the Text to SQL service for my use case, I came across examples that advocate the approach of splitting prompts into smaller chunks. This method is commonly employed in text summarization tasks. However, I recognize that this approach may not seamlessly translate to the nuanced complexities of generating SQL queries. SQL queries often require a coherent and comprehensive understanding of the entire schema, including relationships and inter-dependencies between tables, which may not be optimally preserved through a simple chunking process.

Given these considerations, I am seeking guidance on how to best leverage the Text to SQL service to accomplish my goal of generating accurate and contextually coherent SQL queries for my database with a large schema. Specifically, I am interested in understanding how to effectively structure and provide input to the model while ensuring that it comprehends the complete schema contextually.

Your expertise in this matter would be greatly appreciated. If you could share any recommended strategies, techniques, or best practices for utilizing the GPT-3.5-Turbo-16K model’s extended context for generating SQL queries in a complex schema environment, it would significantly aid me in achieving optimal results.

Thank you!

Maybe you can get lucky by embedding the database structure and make calls to the embedded data then?

I absolutely agree, that you won’t be able to build advanced CTEs with window functions that run in sub milli seconds over 50 tables by giving it chunks.

But I also don’t think that using automatic SQL creation on a GPT model is a way to go. Just imagine someone inception prompt injects a drop table…

Maybe using it to create hundrets of endpoints with a specific task and checking the SQLs manually is a way to go. But never let a model create SQL to run on a database.

I completely agree with your statement. I think embedding the database structure would be a good idea to start. Will try that. Thank you!

SQL is structured enough that you can pre-check it before running it.
You could also run it on a connection with read-only permissions.
These are general recommendations that are a good idea even if NOT using a model, btw :slight_smile:

2 Likes

Where is this example, please?

Update: Never mind. Found it: https://platform.openai.com/examples/default-sql-translate

Yeah but you can down a db with a Select as well.

Mulitiple level of subselects with cartesian products and your database might be gone for good… You can also let it do some evil calculations and stuff.

Somewhat off topic at this point, but if you’re generating SQL using an LLM and are concerned with management of the queries generated:

  • Most databases let you specify how many resources (memory, CPU, locks) a particular query gets to use, so it can’t lock out all other queries.
  • If you’re using a database in production for OLTP, you probably already have timeouts for queries, and a separate analytics replica for long-running queries and OLAP.
  • And if you’re really modern, you use something like Snowflake that lets you spin up an entirely separate database service (warehouse) running on the same data, so you can totally isolate your workload.
2 Likes