Fine-tuning sometimes fails

I have been successfully using fine-tuning for many months to solve a complex multi-class classification task, where I tune the LLM to output a classification code for a text input. (I was initially on Curie and migrated to GPT-3.5-Turbo after Curie was deprecated.)

I carefully measure the accuracy of the resulting model, both against the training data and a hold-out validation set.

Although I have successfully fine-tuned many models, I have had two periods of time when there have been repeated problems in the fine-tuned model’s behavior despite using the same hyperparameters and largely similar training data. When this happens, the accuracy of its classification results on the training data drops from around 98% to 80%, and on the validation data from 91% to 75%.

My question: Although this could somehow be my fault (grateful for any tips!), I’m wondering if OpenAI makes changes to the fine-tuning engine behind the API, which might explain these problems? If so, are such changes announced and documented anywhere? Consistent fine-tuning results are critical for my application.


1 Like