Fine‑tuning GPT‑4.1 for Text→Cypher on a Neo4j Schema

Summary

I have fine‑tuned the new GPT‑4.1 model to translate natural‑language questions into Cypher queries for a two‑node Neo4j schema. My dataset (98 examples) interleaves schema descriptions for each node with NL→Cypher pairs. I observed a final training loss of ~1.33, and I am seeking guidance on dataset size, schema‑instruction design, and best practices for uploading to and operating within the OpenAI platform using GPT‑4.1.

Use case

  • Example: “Find all restaurants in Dubai with rating > 4.”
  • System prompt enforces “Use only the provided properties; do not invent relationships or fields.”

Dataset format:
{“messages”: [{“role”: “system”, “content”: "Task: Generate Cypher queries to query a Neo4j graph database based on the schema and examples you have been finetuned on.\n Instructions:\n Use only the provided properties.\n Do not use any other relationship types or properties that are not provided.\n If you cannot generate a Cypher statement based on the provided schema, explain the reason to the user.\n "}, {“role”: “user”, “content”: “Which restaurants had more than 100 orders in ‘Apr 2025’?”}, {“role”: “assistant”, “content”: “MATCH (n:AccountManagerReport)\nWHERE n.am_month = ‘Apr 2025’ AND toFloat(n.order_nos_curr) > 100\nRETURN n.name AS restaurant, toFloat(n.order_nos_curr) AS order_count\n”}]}

Fine‑tuning Configuration
{
“model”: “gpt-4.1”,
“training_file”: “file-XXXXXXXX”,
“n_epochs”: 10,
“batch_size”: 8,
“learning_rate_multiplier”: 0.1
}

The first few 4 instances contain properties and description of each node and rest all are examples as given above. Could you guys give any suugestions or experience you have gone through.

Is your concern just that the finetuned model is underperforming? How does its accuracy compare to base models?

Something important here is that models aren’t “aware” of the information they’re fine-tuned on. They don’t read / recall it the same way as your input messages. So - if things like n.am_month are specific to your database, then you’ll almost certainly get better results placing this kind of information in your system prompt instead of attempting to finetune it into the model.

If base models are failing to write these queries with perfectly good context, only then would I personally consider any finetuning. Good luck!

1 Like

Thank you so much for taking time to reply!
Actually you are right, I am now first trying with system prompt and it is been working really well. My company and I were trying to look for something which can be our own ownership and thought of trying to finetune to see how it would work but the results have not been good so far honestly.