I’m looking to fine-tune a GPT / LLM for a language editing task. I have a massive data set of texts before and post editing (millions of words / tokens). From my research, however, it seems like most fine-tuning guides recommend relatively small datasets, or even studies like the “Less Is More” paper.
Is it a general practice to fine-tune a model with millions of examples? Are there alternative methods that would be more ideal?
I also tried researching the maximum size of datasets for fine-tuning, and wasn’t able to find any. Is there any limitation I need to be aware if I’m trying to fine-tune with such a dataset?