Summary
I have fine‑tuned the new GPT‑4.1 model to translate natural‑language questions into Cypher queries for a two‑node Neo4j schema. My dataset (98 examples) interleaves schema descriptions for each node with NL→Cypher pairs. I observed a final training loss of ~1.33, and I am seeking guidance on dataset size, schema‑instruction design, and best practices for uploading to and operating within the OpenAI platform using GPT‑4.1.
Use case
- Example: “Find all restaurants in Dubai with rating > 4.”
- System prompt enforces “Use only the provided properties; do not invent relationships or fields.”
Dataset format:
{“messages”: [{“role”: “system”, “content”: "Task: Generate Cypher queries to query a Neo4j graph database based on the schema and examples you have been finetuned on.\n Instructions:\n Use only the provided properties.\n Do not use any other relationship types or properties that are not provided.\n If you cannot generate a Cypher statement based on the provided schema, explain the reason to the user.\n "}, {“role”: “user”, “content”: “Which restaurants had more than 100 orders in ‘Apr 2025’?”}, {“role”: “assistant”, “content”: “MATCH (n:AccountManagerReport)\nWHERE n.am_month = ‘Apr 2025’ AND toFloat(n.order_nos_curr) > 100\nRETURN n.name AS restaurant, toFloat(n.order_nos_curr) AS order_count\n”}]}
Fine‑tuning Configuration
{
“model”: “gpt-4.1”,
“training_file”: “file-XXXXXXXX”,
“n_epochs”: 10,
“batch_size”: 8,
“learning_rate_multiplier”: 0.1
}
The first few 4 instances contain properties and description of each node and rest all are examples as given above. Could you guys give any suugestions or experience you have gone through.