Token Limit vs Minimum number of examples in JSONL

Hi,
I have a jsonl file with 15 messages that I can use to train a model. gpt-3.5-turbo-1106.
However the total tokens used for training is 31000. THe model has a limitation of 16k tokens.
If I break the jsonl file into two the number of tokens are within limits of the model, however the number of messages per jsonl file go below 10 where as the model expects minimum 10 messages to train.
I don’t intend to play with the content of each message that I use for training as it defects the whole purpose of thraiining.
Any suggestions.

Perhaps you misunderstand:

An individual line is a complete context session, like you would have for system message, user messages, and then an example of how the assistant would respond to that stimulus, all formed in one line of JSON. JSONL = JSON lines.

The training context of a single example cannot exceed the model context length.

The number of example lines can and should be quite extensive. 10 is just a minimum imposed to prevent completely fruitless uses of resources.

yes your explanation is correct but 10 messages which is the minimum, no of token goes to 17K which wont be accepted by system as limit is 16k and 9 messages it is 15k. But 9 messages wont be accepted by the system as minimum requirement is 10.

Token limits are the maximum for one example, one line, simulating a question or conversation, and the final assistant reply.

Token limits depend on the model you select. For gpt-3.5-turbo-0125, the maximum context length is 16,385 so each training example is also limited to 16,385 tokens. For gpt-3.5-turbo-0613, each training example is limited to 4,096 tokens. Examples longer than the default will be truncated to the maximum context length which removes tokens from the end of the training example(s). To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under the limit.

A file can have thousands of such examples. The maximum file upload size is 1 GB.

1 Like

so you mean each message has a limit of 16385 tokens and not the entire file ?

Not each “message” per-se, but each set of messages in a total session that represents how an AI should respond after seeing all the previous chat.

You likely do not have a need to pass context anywhere near the maximum per example.

Most fine-tuning is not on making a chatbot, but having an AI produce one output for one input, without needing a large system prompt( because then the fine-tune training has shaped the AI behavior). Training on a long contextual chat is not needed.

The goal is to show the AI how it responds differently from its “chat” pretraining if using gpt-3.5 models.

Developing a training set is a painstaking endeavor. It is not “I made my 10 chat examples, why doesn’t the AI talk about only cats”.

Thanks, this helps :). The context of our discussion helped me solve the problem.