I was wondering about the max token limit of different models that are finetune-able. Here is what I found in the documentation: “Token limits depend on the model you select. For gpt-3.5-turbo-0125
, the maximum context length is 16,385 so each training example is also limited to 16,385 tokens. For gpt-3.5-turbo-0613
, each training example is limited to 4,096 tokens.” my examples are each around 23000 tokens. I was wondering 1) what is the max token limit on the other models? 2) Are there any tricks or tips that anyone can suggest to make the fine-tuning work without shortening the example too much?
Hi!
Unless someone corrects me, I believe that for the latest versions of babbage-002 and davinci-002, the max tokens would also be 16,384. For fine-tuning gpt-4, if you do have or receive access, it would be presumably 8,192 (both could and should be clearer in the fine-tuning documentation).
So under either scenario, you’d have to shorten your examples to fit those limits.
No workaround really - if they are too large the examples will get truncated, which ultimately risks negatively impacting your fine-tuning results.
Thank you for the info.
I see. I wanted to see if anyone was aware of any model with a max token limit above 16k. Seems like that’s it.
It’s on my wishlist, too
I know. It’s unfortunate that they don’t have these rather too small max token size when you prompt the same model. You would think they could make it possible to fine tune with larger token sizes.
Question regarding this same topic of max token limit. Let say I have this structure in each example in my traning data: “would you do x given y? To do x, you should identify and rate A, B, C, D, E, and F in Y”. Now, to get around the token limittaion, if I break each of these A, B, C, D, … in one individual exmple, would the machine learn beyond indivitual example and undrestand that it needs to find all of them in one large Y? I’m afraid it won’t get from pieces to the whole if I train it on the pieces. Any suggestions? Ideas? Insights? Thank you.
I personally think it would not be able to pick up the pattern that it would be able to identify multiple A, B, C, D, … if the training just involves individual examples.
It’s a common behaviour that happens also in other training sets. If you include too many of the same examples without balancing that with other ones, it will likely just replicate the pattern present in the examples that dominated the data set.
Rather than excluding, do you have the possibility to shorten in any way?