Struggling with fine-tuning GPT for generating JSON

rafa.singer · July 9, 2024, 8:03pm

Hey everyone. I’ve been spending a few weeks trying to get GPT to work for my use case: turning a natural language search query into a set of complex JSON search filters. The filter format is extremely specific and there are numerous edge cases that require special handling. Prompt engineering with 4o gets a fairly decent working prototype but each query is ~8000 tokens long (primarily due to the extensive system prompt), which is unsustainable from a cost perspective.

After reading the docs, it seemed like fine tuning would be the ideal approach to get consistent results, reduce latency, and cut down on prompt length, saving costs. To see if fine tuning would work, I generated ~50 examples with a validation set of another 10 examples, as per the docs. The data used to train/validate the model was a mapping of search query to search filters. The training data was programmatically generated through a Selenium script that would emulate a user clicking on the search filters and the validation data was created manually.

Unfortunately, the fine-tuned model did not perform well at all and would not conform to the strict structure the search filters require. Looking at the model metrics graph, I noticed that I seemed to be missing the validation curve. Analyzing the raw metrics, I only had individual data points for the validation loss to compare to the training loss, so it’s difficult for me to determine what went wrong.

I’m wondering if anyone has any ideas on how I can understand what went wrong and how I can improve my fine tuned model on future iterations. Also, just this small dataset was already fairly costly since I included the system prompt (again from the docs) and so if the problem is a lack of data, are there any other ways I can make this more cost-effective?

Diet · July 9, 2024, 9:44pm

Welcome to the community!

It sounds like the queries you’re generating might depend on contextual information. Fine-tuning isn’t great at getting the model to retain specific information.

Have you considered perhaps dynamically loading schema information into the prompt based on the search query?

using embeddings, or customizing embeddings might be a good option here. here’s an example: Customizing embeddings | OpenAI Cookbook

but raw embeddings might work too, I’d try that first.

Topic		Replies	Views
Seeking Advice on Fine-Tuning NLP Model for Response Generation Community gpt-35-turbo	3	110	February 24, 2025
Struggling with poor performance on fine-tuned davinci model API	15	2663	December 20, 2023
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1845	December 18, 2023
Fine tuning very very poor results API fine-tuning , api	16	2888	July 11, 2023
Fine‑tuning GPT‑4.1 for Text→Cypher on a Neo4j Schema Community gpt-4 , fine-tuning	2	110	May 12, 2025

Struggling with fine-tuning GPT for generating JSON

Related topics