How to train Codex on a complex SQL legacy database model

My goal is to create an interface whereby a non technical user can input a normal language question and receive the matching SQL query for a specific legacy database. The challenge is that the database is not created with a consistent model.
F.e. The table humres (Human resources) , columns (res_id, fullname) is joined with a persons access roles, table hrroles columns (empid,roleid).

If I input the following into the sandbox, I get a valid query.


# table = humres, columns = [res_id, fullname]
# table = hrroles, columns = [empid, roleid]
The table humres stores employee information, The table hrroles stores the access roles and is linked through empid

### a query displaying the fullname  for employees with role 14

However the database has approximately 250 tables with 50 columns per table.

What would be the approach be to get all tables and linking information into an API, to accommodate a wide array of user questions?

First of all, at the moment, it is not possible to train or fine-tune Codex. However, it may be possible to find a workaround by using multiple calls with a combination of the Davinci and Codex models.

One approach to tackling this type of challenge could be to categorize the 250 tables into smaller groups based on their purpose, functionality, data, or other criteria. Then, you could use multiple steps to narrow down the required tables by giving users options to select or by using API calls to extract their intention. This could also help you save money by using fewer tokens in the requests.

Keep in mind that the Davinci model is a very powerful language model and is capable of understanding users’ requests and generating SQL statements.

Lastly, it is important to take precautions against so-called “prompt injection attacks”. You may find this blog ( helpful for a quick understanding of how to protect against these types of attacks.

1 Like

Ni Nick, can you elaborate on “giving users options to select or by using API calls to extract their intention.”?


Hey @ceesmeuleman did you figure out a solution to this? I’m in the same boat looking to load a lot of tables but obvious it’s a lot. Curious to see if you figured something out

I’m still looking into fine tuning a model on example questions and see how far we can train the model in this aspect. However providing enough valid training data (ak. Questions & correct queries) takes a lot of work.