How to train Codex on a complex SQL legacy database model

ceesmeuleman · January 2, 2023, 10:39pm

My goal is to create an interface whereby a non technical user can input a normal language question and receive the matching SQL query for a specific legacy database. The challenge is that the database is not created with a consistent model.
F.e. The table humres (Human resources) , columns (res_id, fullname) is joined with a persons access roles, table hrroles columns (empid,roleid).

If I input the following into the sandbox, I get a valid query.

#SQL

# table = humres, columns = [res_id, fullname]
# table = hrroles, columns = [empid, roleid]
The table humres stores employee information, The table hrroles stores the access roles and is linked through empid

### a query displaying the fullname  for employees with role 14
SELECT

However the database has approximately 250 tables with 50 columns per table.

What would be the approach be to get all tables and linking information into an API, to accommodate a wide array of user questions?

nick.yang · January 3, 2023, 12:51am

First of all, at the moment, it is not possible to train or fine-tune Codex. However, it may be possible to find a workaround by using multiple calls with a combination of the Davinci and Codex models.

One approach to tackling this type of challenge could be to categorize the 250 tables into smaller groups based on their purpose, functionality, data, or other criteria. Then, you could use multiple steps to narrow down the required tables by giving users options to select or by using API calls to extract their intention. This could also help you save money by using fewer tokens in the requests.

Keep in mind that the Davinci model is a very powerful language model and is capable of understanding users’ requests and generating SQL statements.

Lastly, it is important to take precautions against so-called “prompt injection attacks”. You may find this blog (https://harishgarg.com/writing/protecting-against-gpt-3-prompt-injection-attack/) helpful for a quick understanding of how to protect against these types of attacks.

jacketinsf · February 14, 2023, 9:42pm

Ni Nick, can you elaborate on “giving users options to select or by using API calls to extract their intention.”?

chirag.shah285 · March 4, 2023, 2:06pm

Hey @ceesmeuleman did you figure out a solution to this? I’m in the same boat looking to load a lot of tables but obvious it’s a lot. Curious to see if you figured something out

ceesmeuleman · March 24, 2023, 2:34pm

I’m still looking into fine tuning a model on example questions and see how far we can train the model in this aspect. However providing enough valid training data (ak. Questions & correct queries) takes a lot of work.

Topic		Replies	Views
Fine-tune model based on mysql schema API	3	2513	January 6, 2023
Natural Language to SQL with huge table schema API	12	9238	December 19, 2023
How to fine tune text to sql? API	19	9160	April 26, 2024
Generate SQL queries combining prompt engineering and fine-tuning API	4	8835	December 24, 2023
Text 2 SQL - Schema too big to fit into completion API request API	2	1287	February 6, 2025

How to train Codex on a complex SQL legacy database model

Related topics