Although input context windows have grown to 128K, it seems that models remain limited to producing output of at most 4K.
A couple questions:
- Why the limitation? It seems like a next output token is produced by LLMs based on a prompt and all of the previous output. As long as there is plenty of space in the input context window, it seems like they could be fed their previous input until the 128K limit is reached. Perhaps there is some technical limitation or models haven’t been trained to remain coherent for longer responses?
- What have folks found are the most reliable methods for producing longer responses?
Previously, I have been handling this limitation by breaking my input into smaller chunks whenever I start to hit this limit. However, currently, I have a task where this chunk approach won’t work so well because:
- The output will be higher quality if it can be “understanding” all of the input as a whole rather than individual sections of it.
- The output is highly variable, so with some inputs it might output only a couple items and with some inputs it might output hundreds of items.
I am thinking of a technique for having the llm “continue what it was doing”. Accept the partial output (finish_reason != "stop), fix it up by discarding the last partial item that may have been truncated, and then prompt it again, telling it what the previous output was, and asking to continue where it left off. I can’t have it just directly continue from the raw text because I am using the json_object output mode, so I don’t think it can continue from the middle of incomplete JSON.
Alternatively, I may be able to break my operation down into multiple smaller steps where it first generates global information about this items it is looking for and then finds them one at a time from smaller chunks, but that would be much more complicated than a general-purpose “continue what you were doing” technique.
Any tips or a better way to do this?