How to fine tune text to sql?

I’m trying to fine tune a davinci (non-codex) model to improve text to sql, but am having trouble doing so. In the example: OpenAI API, the SQL being generated uses the schema defined in the beginning. I want to ask the model to generate the SQL query from the provided schema, but fine tune the modal on the schema beforehand so I don’t have to pass in the schema in the prompt. How do I do this via fine tuning?

1 Like

I might be misunderstanding you but if not I think you might be putting the cart before the horse. The key thing here is that you need to design a finetuning dataset that suits your intention. There are two fundamental parts here: what is the INPUT and what is the OUTPUT? If you can give a few examples of the INPUT/OUTPUT pairs you want, I can comment further on how to achieve what you want.

2 Likes

So in the playground, if I put

### Postgres SQL tables, with their properties:
# company_employees_table(id, name, department_id)
###

give me all employees:

the SQL that the model gives back is
SELECT * FROM company_employees_table

which is great as it took into account the schema I provided in the prompt. But if I only have

### Postgres SQL tables, with their properties:
###

give me all employees:

The resulting SQL is
SELECT * FROM employees

Which I guess makes sense, since the model doesn’t have a schema to go off of. However, I don’t want to have to pass the schema in as part of the prompt every time I ask the model to complete a SQL query.

Can I fine tune a model to learn/understand a schema such that I can ask for SQL query completions from that schema without having to provide that schema on every request?

You could… but then you’d have a finetuned model that understands only a single schema. if you’re going to make the investment of time/energy you would be better off designing something that can take input that includes the schema, as well as what you want, and then generate the appropriate query. Basically, I would recommend aiming for a more universal query generator.

Yes, I understand that it would only be good for one specific schema. But that is exactly what I want. So far my fine tuning efforts have not succeeded. Can you give me some suggestions on what the prompt and completion values would need to be to achieve this?

To your point though - I do like that strategy, but wouldn’t it only work for schemas of a certain size as passing in the schema would count towards my token total on every completion request?

you would just need hundreds of examples of a natural language query on the prompt side with the output query on the completion side. Here’s how I would format it:

Give me a SQL query that does X, Y, Z 
(natural language as long and complex as you need)

SQL QUERY:

And the completion:

SELECT * FROM * WHERE * [etc etc etc]

That’s really interesting. There’s no way to fine tune the model for memorization?

For example, passing in multiple prompts that ask for a table in the schema and return a table in that schema. Such that the model would then recognize all the tables that belong in the schema. So when I prompt for a SQL query in that schema, the model would know what subset of tables I’m referring to?

That “memorization” would be implicit in the samples you give. They would need to be particular to your environment. However, it’s important to know that finetuning is not reliable in this respect. It is liable to continue confabulating unless you use many thousands of examples of query/SQL pairs that are specific to your environment.

You’d be better off doing prompt chains. First prompt/SQL would be to understand the environment, then pipe that into the next prompt, etc.

Is there a way to save a prompt chain and continue off at that point?

So in this example, I’d do a prompt chain to have the model understand my schema, save that point, and then always ask for a SQL query from that point in the future?

Edit:
@daveshapautomator I might be misunderstanding what you mean by “prompt chains”. Is a prompt chain simply appending the last response and an additional prompt to the last prompt and sending that as the prompt of a new completion request? If so, doesn’t this run into the same issue as sending in the entire schema on every request (too much token usage)?

These links will do more to clarify prompt chaining that I can type quickly :slight_smile:

Thanks for the info - from what I understand, prompt chaining allows the user to break down a larger problem and into a series of smaller problems by using the output of a previous prompt request.

I don’t think it would help in my situation as SQL query generation requires understanding the entirety of a database schema before a single SQL prompt can be asked.

So if I take the fine tuning approach and have jsonl objects like

{"prompt":"describe a table in schema A", "completion":"employees(id int, name string)"}

If I prompt for a SQL query like “give me a SQL query of all employees using schema A”, will the model understand to only use tables from schema A?

Here’s a simple text-to-sql generator I made: Words_to_sql - a Hugging Face Space by Gradio-Blocks

If you want to specialize it to a specific table with specific attributes, I would first get that data read in and then supply it to the model as a context in a standardized way