Chat Completions output cutting off without hitting max_tokens limit

I’m having trouble generating long responses. I have an 2286 long prompt_tokens input, but after telling the model to generate 10 examples of something I want it to generate, it cuts off well before hitting the 4096 output limit (or total), sometimes not even hitting the 3000 total_tokens mark. For example, after generating examples 1-5 correctly, it might just stop in the middle of example 6, with a finish_reason of “stop”, instead of “length”.

Due to having a fairly long input, I want to limit the amount of calls I make, so I’d ideally not want to only generate 5 at a time, even if that is possible.

Is this intended behavior? Does anyone have any suggestions?

Welcome to the Forum!

It is not unusual for the model to return significantly less tokens than the defined output token limit of 4,096. Typically it’s difficult to get the model to return consistently more than 3,000 tokens. This is why it is often recommended to break up tasks into smaller pieces.

Overall, when it comes to the length of your output, much depends on how you phrase your prompt. If you have not already done so you could provide an example output as part of your prompt or be more explicit about the output. For example a phrase like the following can help to force the model into adhering to the number of examples:

Your output should be returned in the form of a numbered list as follows:

1. Example 1: <Description of example>
2. Example 2: <Description of example>
...
10. Example 10: <Description of example>

That said, if what you are asking the model is of high complexity, then even that might fail.

1 Like