How to fine tune text to sql?

I’m trying to fine tune a davinci (non-codex) model to improve text to sql, but am having trouble doing so. In the example: OpenAI API, the SQL being generated uses the schema defined in the beginning. I want to ask the model to generate the SQL query from the provided schema, but fine tune the modal on the schema beforehand so I don’t have to pass in the schema in the prompt. How do I do this via fine tuning?

2 Likes

I might be misunderstanding you but if not I think you might be putting the cart before the horse. The key thing here is that you need to design a finetuning dataset that suits your intention. There are two fundamental parts here: what is the INPUT and what is the OUTPUT? If you can give a few examples of the INPUT/OUTPUT pairs you want, I can comment further on how to achieve what you want.

2 Likes

So in the playground, if I put

### Postgres SQL tables, with their properties:
# company_employees_table(id, name, department_id)
###

give me all employees:

the SQL that the model gives back is
SELECT * FROM company_employees_table

which is great as it took into account the schema I provided in the prompt. But if I only have

### Postgres SQL tables, with their properties:
###

give me all employees:

The resulting SQL is
SELECT * FROM employees

Which I guess makes sense, since the model doesn’t have a schema to go off of. However, I don’t want to have to pass the schema in as part of the prompt every time I ask the model to complete a SQL query.

Can I fine tune a model to learn/understand a schema such that I can ask for SQL query completions from that schema without having to provide that schema on every request?

You could… but then you’d have a finetuned model that understands only a single schema. if you’re going to make the investment of time/energy you would be better off designing something that can take input that includes the schema, as well as what you want, and then generate the appropriate query. Basically, I would recommend aiming for a more universal query generator.

Yes, I understand that it would only be good for one specific schema. But that is exactly what I want. So far my fine tuning efforts have not succeeded. Can you give me some suggestions on what the prompt and completion values would need to be to achieve this?

To your point though - I do like that strategy, but wouldn’t it only work for schemas of a certain size as passing in the schema would count towards my token total on every completion request?

you would just need hundreds of examples of a natural language query on the prompt side with the output query on the completion side. Here’s how I would format it:

Give me a SQL query that does X, Y, Z 
(natural language as long and complex as you need)

SQL QUERY:

And the completion:

SELECT * FROM * WHERE * [etc etc etc]

That’s really interesting. There’s no way to fine tune the model for memorization?

For example, passing in multiple prompts that ask for a table in the schema and return a table in that schema. Such that the model would then recognize all the tables that belong in the schema. So when I prompt for a SQL query in that schema, the model would know what subset of tables I’m referring to?

That “memorization” would be implicit in the samples you give. They would need to be particular to your environment. However, it’s important to know that finetuning is not reliable in this respect. It is liable to continue confabulating unless you use many thousands of examples of query/SQL pairs that are specific to your environment.

You’d be better off doing prompt chains. First prompt/SQL would be to understand the environment, then pipe that into the next prompt, etc.

1 Like

Is there a way to save a prompt chain and continue off at that point?

So in this example, I’d do a prompt chain to have the model understand my schema, save that point, and then always ask for a SQL query from that point in the future?

Edit:
@daveshapautomator I might be misunderstanding what you mean by “prompt chains”. Is a prompt chain simply appending the last response and an additional prompt to the last prompt and sending that as the prompt of a new completion request? If so, doesn’t this run into the same issue as sending in the entire schema on every request (too much token usage)?

These links will do more to clarify prompt chaining that I can type quickly :slight_smile:

Thanks for the info - from what I understand, prompt chaining allows the user to break down a larger problem and into a series of smaller problems by using the output of a previous prompt request.

I don’t think it would help in my situation as SQL query generation requires understanding the entirety of a database schema before a single SQL prompt can be asked.

So if I take the fine tuning approach and have jsonl objects like

{"prompt":"describe a table in schema A", "completion":"employees(id int, name string)"}

If I prompt for a SQL query like “give me a SQL query of all employees using schema A”, will the model understand to only use tables from schema A?

1 Like

Here’s a simple text-to-sql generator I made: Words_to_sql - a Hugging Face Space by Gradio-Blocks

If you want to specialize it to a specific table with specific attributes, I would first get that data read in and then supply it to the model as a context in a standardized way

This is great. I assume you did not train this with any dataset.

Also could you elaborate on specialization? Say I have several tables with each containing 10 columns. How would I feed this data into GPT fine tuning? The objective is the same as your application but database specific and without specifying the coded table names.

1 Like

Hi @mert1 Did you able to fine-tune GPT with specific database schema? I have a same use case. Thanks.

1 Like

Hello @curranjanssens, I find the application that also reads the excel or csv (“Natural Language to SQL”) file extremely interesting. For me is not clear the format that the table must have, where can I find an example?

Hi @daveshapautomator,

Even if I have a large dataset of prompt/query and fine-tune a model would it still make sense to use on the prompt some specification and rules to safeguard the sql output ? Potentially still passing the schema or parts ?

Thank you,
Daniel

It is completely possible to use a model to create safe database queries. You would need to be comfortable with fine-tuning, or get away with a generic model. You’d probably also want something to validate that it is a safe/valid SQL query.

I have seen some very promising open-source text-to-sql models. I wish I could find them again, if I do, I’ll post them here.

1 Like

Thank you @RonaldGRuckus for the inputs. I will check those.

Hello Sir, i see you mentioned about universal text to sql, i am trying to make one, however i wanted to understand that should i pass the schema too from user . I needed some clarification on these, how can i proceed with this task

Hi anyone, could please help me I am also trying to fine-tune my openAI model and I have also passed by database schema, I want to make AI interactive

for e.g: If user enter the prompt crate or add xyz as a company in this case AI assistant ask or check if xyz is alredy created or not if created, it will ask user this company is already created do you still want to create it.

If yes then newly company created otherwise not.

In this above case, how should I fine tune my model to give correct SQL Queries.

Note: I have to handle more similar cases.

Thanks.