Is it advisable to fine-tune with a large dataset?

I’m looking to fine-tune a GPT / LLM for a language editing task. I have a massive data set of texts before and post editing (millions of words / tokens). From my research, however, it seems like most fine-tuning guides recommend relatively small datasets, or even studies like the “Less Is More” paper.

Is it a general practice to fine-tune a model with millions of examples? Are there alternative methods that would be more ideal?

I also tried researching the maximum size of datasets for fine-tuning, and wasn’t able to find any. Is there any limitation I need to be aware if I’m trying to fine-tune with such a dataset?

Since you already have the dataset it may be worth just trying a percentage amount (let’s say 10%?) and seeing how well the model adapts. You could even distill this large amount. Run some checks on the data, see what sticks out, what causes what to move. Are there bad data points? Ambiguity?

Starting small is nice because you can observe, test, and maneuver rapidly. Rather than a hail mary.

This way you’ll have a very intimate understanding of the ongoing process and almost know what kind of data the model is saying “feed me next!”

2 Likes