We have a web application which calls gpt-4-1106-preview with stream: true in the backend. For instance,
const input = {
model: "gpt-4-1106-preview",
messages: [{ role: 'user', content: "I have a long text to show you..." }],
stream: true,
}
const stream = await openai.chat.completions.create(input);
We often receive error messages like 400 This model's maximum context length is 4097 tokens. However, your messages resulted in 16727 tokens. Please reduce the length of the messages.
But should not gpt-4-1106-preview accept 128k context lengths? Does anyone know how we could increase the maximum context length and avoid the 400 error?
If you are indeed specifying the correct AI model, this may be a case of an incorrect error message evoked by specifying exactly the wrong value of parameter.
I would look first at max_tokens that you are using. That is the response length reservation in tokens. It is NOT telling the AI its own context window.
A good maximum is about 1500 tokens, the most you will get out of the model unless doing specific data processing tasks.
The maximum output this AI model can be set to is 4k.
You can send VERY large $1 inputs to the model no problem.
The issue that was faced was not understanding that:
the max_tokens setting is only for the size of the response; it doesn’t correspond to the total context length of the model that you want to use or relate to what you send (except for subtracting from the available space);
the gpt-4-turbo models have an artificial limitation of 4k maximum output despite their large context that would make one think they could produce longer answers.
Solution:
Ask for a reasonable max_tokens like 2000 - that prevents billing overages if the model goes crazy.
Send up to 126000 tokens of input - if you want to pay for it - and hope the AI can pay attention to all of it at once.