Davinci not learning new patterns after fine-tuning, forgetting to answer questions

I’m trying to fine-tune davinci to write multiple choice questions following a certain structure, tone and style. But after fine-tuning on the training data, the model fails to consistently generate new multiple choice questions matching even the basic structure of the training samples: a problem statement and 5 response options.

The dataset is 400 samples consisting of

"prompt": "Write a multiple choice question about ... \n\n###\n\n", 
"completion": " <full problem statement> ###"

I fine-tuned for 2 epochs, no learning rate specified. I have tried fine-tuning with learning rates between 0.2 and 1, and all models fail to write questions following the structure in the training data.

The model also seems to have catastrophically forgotten how to answer questions. For instance, if you ask text-davinci-003 “What is the capital of France?” you typically get the completion Paris.. However, the fine-tuned models complete:

What is the capital of France?

What is the capital of France?

or something similarly repetitive.

I have experimented with different temperatures, max_tokens, and frequency/presence penalties with no improvement. Are there any best practices for fine tuning the model so it learns to complete following its training examples when prompted, without catastrophically forgetting other things it used to be good at? I have already read through the entirety of the documentation here. Thanks!

1 Like

You may need to fine-tune longer—four or more epochs.

Also, 400 examples is low. You should have at least 500, but more (1000+) is usually better.

Do all of your examples follow the same format precisely?

Do you have an appropriate separator between the prompt and response in every example?

Are you including that same separator in the prompt you are passing to the fine-tuned model?

Hi elmstedt! Yes all the training completions have the same basic structure of question statement followed by 5 response options, and all prompts and completions in the dataset have been preprocessed with separators like in the example, and I do use that separator when asking the fine-tuned model for completions. I can try more epochs though!

elmstedt is right, 400 is not enough, try more than 750

This link also might be helpful > Fine-tuning a Classifier to Improve Truthfulness | OpenAI Help Center


Building off of this, once you’re able to get the fine-tuning to work mostly correctly, you can often use the fine-tuned model to generate additional training examples that may just need a little bit of tweaking to work correctly.

Great minds! I’m trying GPT4 to do data augmentation and it’s looking pretty good

So if you using synthetic GPT-4 generated sample, you may also need to consider this, Aligning language models to follow instructions , what you are doing is supervised fine-tuning


After checking the documentation again I noticed that the models currently available for fine-tuning are pre-InstructGPT models. So it wasn’t catastrophically forgetting QA, the models had never learned it! Thanks for all the help.