GPT3.5 Finetuning Dataset Size

Was wondering if there is a limit to the fine tuning dataset size. Can we potentially have as much as a million or more examples for “messages” in our dataset to finetune GPT3.5 - don’t worry i’m aware of the costs associated with this, just want confirm.

Docs: Each file is currently limited to 50 MB.

I recently read “50000” examples as a max, but can’t find it again so it is perhaps not applicable.

1 Like

I think you are referring to this:

Token limits
Each training example is limited to 4096 tokens. Examples longer than this will be truncated to the first 4096 tokens when training. To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under 4,000. Each file is currently limited to 50 MB.

But I might be wrong