Has fine-tuning gotten worse since last year?

Hi everyone,

In August 2023 we ran some experiments using GPT-3.5-0613. We fine-tuned a model via the OpenAI Python API to perform a text-classification task, i.e. predict a label given a sequence of text. We evaluated the model’s performance and captured the results in a research paper - the micro F1 score was around ~0.8, meaning the LLM outperformed an existing baseline on this task.

We recently attempted to replicate our results and have not been able to achieve the same performance. We used the exact same source code to fine-tune the same GPT model (GPT-3.5-0613) with the same parameters, on the same prompts & dataset, and the newly-finetuned model performs significantly worse than the model we fine-tuned back in August 2023. The micro F1-score is ~0.5, as a result of many poorly-formed responses from the LLM. We cannot understand why this is happening - it is as if the fine-tuning process (or underlying model) has gotten worse since we first ran our experiments.

Interestingly, if we load the exact model we fine-tuned back in August, and run inference against the same test set, the results are the same as they were in August - a micro F1-score of ~0.8. This indicates that there must be something that has changed in OpenAI’s fine-tuning process, as far as we can tell.

After exhaustively going through our code and ensuring that everything is exactly the same as it was back in August, we decided to make a post here in the hope that we weren’t the only ones who had observed a degradation in the quality of fine-tuned models since last year. If anyone has any ideas what is going on, please let us know - it would be much appreciated. Has anyone else experienced this issue?