Issues with GPT-3.5-Turbo-0125 Fine-Tuning Results

Hi everyone,

We have been fine-tuning GPT-3.5-Turbo-0125 for function calling, and up until June 10th, we achieved excellent results with very low loss and high accuracy on our evaluation set. However, starting from June 13th, we have observed significantly worse performance, characterized by higher loss and reduced accuracy, even though we have been using the same base model and dataset.

Initially, we suspected the issue might be on our end, but we have tried several approaches to troubleshoot:

  • Setting the seed to the one used during our previous successful runs.
  • Using the exact same dataset and hyperparameters.

Despite these efforts, the problem persists.

Additionally, we attempted fine-tuning GPT-3.5-Turbo-1106 and GPT-4:

  • GPT-3.5-Turbo-1106 exhibited similar poor results as GPT-3.5-Turbo-0125.
  • GPT-4 showed better evaluation scores but with a high loss, which might be attributed to the superior performance of the base model rather than