Hi everyone,
When fine-tuning a model for classification with an unbalanced data set (say 10:1), is it useful to simply duplicate examples of the underrepresented class to balance the number of examples per class?
In other ML systems, I’d typically achieve this by weighting the training examples. However that’s not an option here and I’m not sure exactly what’s happening behind the scenes when fine-tuning a GPT-3 model. I’ve also noticed that the CLI data preparation tool flags duplicate examples.
Are there any GPT-3-specific reasons I can expect simple duplication to be a bad strategy?
I have limited data, so don’t want to go the other way and discard examples from the more abundant class(es).