How to handle responses that don't fit in the 4096 token maximum output size

ForrestT · July 12, 2024, 11:03pm

Although input context windows have grown to 128K, it seems that models remain limited to producing output of at most 4K.

A couple questions:

Why the limitation? It seems like a next output token is produced by LLMs based on a prompt and all of the previous output. As long as there is plenty of space in the input context window, it seems like they could be fed their previous input until the 128K limit is reached. Perhaps there is some technical limitation or models haven’t been trained to remain coherent for longer responses?
What have folks found are the most reliable methods for producing longer responses?

Previously, I have been handling this limitation by breaking my input into smaller chunks whenever I start to hit this limit. However, currently, I have a task where this chunk approach won’t work so well because:

The output will be higher quality if it can be “understanding” all of the input as a whole rather than individual sections of it.
The output is highly variable, so with some inputs it might output only a couple items and with some inputs it might output hundreds of items.

I am thinking of a technique for having the llm “continue what it was doing”. Accept the partial output (finish_reason != "stop), fix it up by discarding the last partial item that may have been truncated, and then prompt it again, telling it what the previous output was, and asking to continue where it left off. I can’t have it just directly continue from the raw text because I am using the json_object output mode, so I don’t think it can continue from the middle of incomplete JSON.

Alternatively, I may be able to break my operation down into multiple smaller steps where it first generates global information about this items it is looking for and then finds them one at a time from smaller chunks, but that would be much more complicated than a general-purpose “continue what you were doing” technique.

Any tips or a better way to do this?

ForrestT · August 2, 2024, 5:40pm

After I asked this question, gpt-4o-mini was released, which has 16K output tokens and gpt-4o-64k-output-alpha with 64K output tokens.

Topic		Replies	Views
Continuing content after output token limit? API	3	2258	May 23, 2024
Output length of gpt-4o and gpt-4.5 is far below expected for large input Feedback gpt-4 , gpt-45	3	966	June 9, 2025
How to increase output length or have it continue the conversation? API gpt-4	2	1328	September 26, 2024
16k Input vs Output: Edit and token strategies for long input texts Prompting gpt-35-turbo , python	2	2051	December 17, 2023
How to complete Long API responses? API gpt-35-turbo , chatgpt	6	4999	December 19, 2023

How to handle responses that don't fit in the 4096 token maximum output size

Related topics