I am using turbo 3.5 16k to process “long” texts, mainly for summaries.
Recently I realized that the answers I receive are cut off (or weird in other ways). I assume this is because prompt and response must and will not exceed 16k tokens. And in some cases the prompt almost completely consumes the token limit.
What would be a good strategy to always reserve “enough” tokens for the answer? I am OK with reasonably shortening the original input text. So: Let it be a fixed number, or a certain percentage?
If you are using a parameter max_tokens with your API call, this sets a limit of the size of the response you can receive. The AI might want to write more, but it will be cut off. Make the setting to big, and you’ll get errors back about your input because it reserves space.
You can remove this max_tokens specification completely, and there will be no artificial output limit, nor will you need to do any special calculations in order to use all of the remaining context length after your input for forming an answer.
Then it is simply up to you to not send too much.
The AI models are recently even more trained not to give large answers. It is as if OpenAI gave tons of fine-tuning just for ChatGPT’s 1500 token output limit, and didn’t alter this behavior for any special API or 16k response models.
Thus it will be challenging to compose language or rewrites that are long and take advantage of the output capabilities of the large context length model. Tricks in telling the AI to follow multiple individual instructions, or telling it that it has a 100000 word output limit and 20000 word target may help.