Fine-tuning with unbalanced data

Hi everyone,

When fine-tuning a model for classification with an unbalanced data set (say 10:1), is it useful to simply duplicate examples of the underrepresented class to balance the number of examples per class?

In other ML systems, I’d typically achieve this by weighting the training examples. However that’s not an option here and I’m not sure exactly what’s happening behind the scenes when fine-tuning a GPT-3 model. I’ve also noticed that the CLI data preparation tool flags duplicate examples.

Are there any GPT-3-specific reasons I can expect simple duplication to be a bad strategy?

I have limited data, so don’t want to go the other way and discard examples from the more abundant class(es).

You should get good performance if you fine-tune with a 10:1 unbalanced dataset, as long as you have a sufficient number of examples. If you want to, you can modify the logit_bias at inference time to boost the probability of the underrepresented class if needed.


Thanks, Boris! That’s helpful.