Fine-tuned models overfit to the fine-tuning data during text generation?

Overview

I recently fine-tuned a curie model with around 6000 reviews on the airlines and restaurants. I would like to let the model generate reviews on other services (like clothes, customer electronics, etc.) to augment the training data for domain adaptation.

I spent quite some efforts on filtering the fine-tuning data. I believe they are simple enough for a the curie engine to comprehend. So my fine-tuning data looks like following:

{'prompt': 'A positive review on seat\n\n###\n\n',
  'completion': ' This seat was fantastic. ###'},
 {'prompt': 'A positive review on service\n\n###\n\n',
  'completion': ' That service is adorable. ###'},
 {'prompt': 'A positive review on airline\n\n###\n\n',
  'completion': ' That airline was awesome. ###'}]

Issues

When I tried to generate reviews on other products, the topics remain in the same domain as fine-tuning data. For example, for a input prompt "A positive review on laptop." (where I explicitly asked to model to generate reviews on laptop), the followings are returned by the model.

To avoid repetition, I set both Temperature and Top P to 1. I also set Frequency penalty to 0.5.

 That service was wonderful. ###

 That was an exciting pilot. ###

 We like that flight. ###

 I hate that pilot. ###

 I appreciate the customer service. ###

Questions

I am suspecting the model here overfits to the fine-tuning data and there is catastrophic forgetting. I am not sure how to resolve this issue. I know one straightforward way is to fine-tune another model with more diverse fine-tuning data. But this is not really feasible in my case.

Any updates/learnings about this @MrRobot ? I’m new to using OpenAI’s models and am concerned about overfitting. Hope you were able to figure it out!

It could be due to low number of examples in the dataset…however, 6,000 seems like it should be enough.

Try lowering your frequency penalty to .03 or even 0…

It could also be that Curie model isn’t a good enough model for what you’re trying?

Hope this helps!

ETA: Back when I fine-tuned GPT-2, I realized that 1.6 to 2 epochs was the best to avoid over-fitting in a lot of cases… That might be something else to look at as you do have 6,000 examples…