Questions about fine-tuning GPT-3.5-turbo

Hello esteemed community,
I am working on a project for a digital health company, and I need to choose between “clever prompting” and finetuning. While I have been able to reach quite good prompting quality, there is still some room for improvement, and I played around with finetuning.

I’ve created a dataset of roughly 1M tokens for training and another 1M for validation, from high quality datasets related to the business.

When I ran with 10 epochs, the validation loss followed a typical “sweet spot” trend, with consistent improvement over the first half of the training, and then degragation thereafter. So I figured that I overshot with the #epochs, and 5~ would be optimal. So I ran another training on the same data, this time I asked for only 5 epochs. Weirdly, the resulting validation loss graph followed a similar “sweet spot” trend, indicating progress until roughly half time (2.5 epochs) and degragation thereafter.

Did anyone experience a similar phenomenon?

This made me suspect that the learning rate is higher the lower the number of requested epochs are. But that doesn’t make much sense. Also, I am very confused by the plotting of the training and validation loss on the finetuning dashboard. What is the x axis? When I ran 10 epochs, it reached roughly 2000, and when I ran 5 epocs, it aslo reached roughly 2000. Does anyone know what units are measured there?

Finally, can I assume that the model that is outputted corresponds to the epoch with the lowest validation loss, or is it simply the last epoch?

Any ideas?

Thanks :slight_smile:

1 Like

OpenAI has taken away some of the other machine learning hyperparameters that were previously used, and we know that they adjust these based on the training data size. That you specify a different number of epochs and don’t see the change is unexpected.

Here is exact code where I ask for and receive one epoch

import os
import openai
openai.api_key = key
    created = openai.FineTuningJob.create(
        hyperparameters={"n_epochs": 1})
except openai.error.APIError as err:
    # Handle API error, e.g. retry or log
    print(f"OpenAI API returned an API Error: {err}")
# more error handing omitted
except Exception as err:
    error_message = f"Error: {str(err)}"

Fine-tune can now be continued, meaning you can train for one epoch, and then specify your fine-tune model as the base to generate a second model equivalent to two epochs.

Also, you don’t need to hold out half of your questions for validation; you’ll get better inference with more varied coverage from including more of those you have prepared. I would shuffle around the questions and just validate on 10-20%.

The actual score is just how well token sequences in the unseen data are predicted. You should also try out the model to see the actual quality of the responses to validation, and also to those outside of the trained domain using your application identity.

1 Like