Hey everyone. I’ve been spending a few weeks trying to get GPT to work for my use case: turning a natural language search query into a set of complex JSON search filters. The filter format is extremely specific and there are numerous edge cases that require special handling. Prompt engineering with 4o gets a fairly decent working prototype but each query is ~8000 tokens long (primarily due to the extensive system prompt), which is unsustainable from a cost perspective.
After reading the docs, it seemed like fine tuning would be the ideal approach to get consistent results, reduce latency, and cut down on prompt length, saving costs. To see if fine tuning would work, I generated ~50 examples with a validation set of another 10 examples, as per the docs. The data used to train/validate the model was a mapping of search query to search filters. The training data was programmatically generated through a Selenium script that would emulate a user clicking on the search filters and the validation data was created manually.
Unfortunately, the fine-tuned model did not perform well at all and would not conform to the strict structure the search filters require. Looking at the model metrics graph, I noticed that I seemed to be missing the validation curve. Analyzing the raw metrics, I only had individual data points for the validation loss to compare to the training loss, so it’s difficult for me to determine what went wrong.
I’m wondering if anyone has any ideas on how I can understand what went wrong and how I can improve my fine tuned model on future iterations. Also, just this small dataset was already fairly costly since I included the system prompt (again from the docs) and so if the problem is a lack of data, are there any other ways I can make this more cost-effective?