Fine-tuned models overfit to the fine-tuning data during text generation?

MrRobot · December 7, 2021, 3:42pm

Overview

I recently fine-tuned a curie model with around 6000 reviews on the airlines and restaurants. I would like to let the model generate reviews on other services (like clothes, customer electronics, etc.) to augment the training data for domain adaptation.

I spent quite some efforts on filtering the fine-tuning data. I believe they are simple enough for a the curie engine to comprehend. So my fine-tuning data looks like following:

{'prompt': 'A positive review on seat\n\n###\n\n',
  'completion': ' This seat was fantastic. ###'},
 {'prompt': 'A positive review on service\n\n###\n\n',
  'completion': ' That service is adorable. ###'},
 {'prompt': 'A positive review on airline\n\n###\n\n',
  'completion': ' That airline was awesome. ###'}]

Issues

When I tried to generate reviews on other products, the topics remain in the same domain as fine-tuning data. For example, for a input prompt "A positive review on laptop." (where I explicitly asked to model to generate reviews on laptop), the followings are returned by the model.

To avoid repetition, I set both Temperature and Top P to 1. I also set Frequency penalty to 0.5.

 That service was wonderful. ###

 That was an exciting pilot. ###

 We like that flight. ###

 I hate that pilot. ###

 I appreciate the customer service. ###

Questions

I am suspecting the model here overfits to the fine-tuning data and there is catastrophic forgetting. I am not sure how to resolve this issue. I know one straightforward way is to fine-tune another model with more diverse fine-tuning data. But this is not really feasible in my case.

glavin001 · December 23, 2022, 6:58pm

Any updates/learnings about this @MrRobot ? I’m new to using OpenAI’s models and am concerned about overfitting. Hope you were able to figure it out!

PaulBellow · December 23, 2022, 7:02pm

It could be due to low number of examples in the dataset…however, 6,000 seems like it should be enough.

Try lowering your frequency penalty to .03 or even 0…

It could also be that Curie model isn’t a good enough model for what you’re trying?

Hope this helps!

ETA: Back when I fine-tuned GPT-2, I realized that 1.6 to 2 epochs was the best to avoid over-fitting in a lot of cases… That might be something else to look at as you do have 6,000 examples…

Topic		Replies	Views
Fine-tuning and worse results that base Davinci API	8	1268	January 21, 2024
Overfitting when giving samples in prompts Prompting	10	1311	December 20, 2023
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1874	December 18, 2023
Fine tuning reducing randomness API	4	693	December 20, 2023
Finetuning Breaks Engineered Prompts API	8	1230	December 24, 2023

Fine-tuned models overfit to the fine-tuning data during text generation?

Overview

Issues

Questions

Related topics