Quality VS Quantity For Finetuning

I am in the midst of training a model, and am wondering what the balance of quality and quantity is in terms of impact on finetuning performance.

It takes a long time to gather high quality finetuning examples, which puts a bit of a cap on quantity if you want all your examples to be incredible.

However, it is easy to gather a lot of mediocre data.

Which would you say is more important?

To quote from the latest OpenAI fine-tuning guidance

In general, if you have to make a trade-off, a smaller amount of high-quality data is generally more effective than a larger amount of low-quality data

They’ve recently expanded the guidance and articulated some more detailed considerations for data quality and quantity that you may find helpful for your use case.

2 Likes