GPT-4o Context Length Issue: Input Tokens Within Limit but Exceeds Maximum

I’m encountering an issue with GPT-4o where my requests exceed the maximum context length. The error message states: This model's maximum context length is 128000 tokens. However, your messages resulted in 249114 tokens . When I check my input with Tiktoken, it only shows 73,878 tokens, which should leave ample space for the output. I’ve also set the max output token limit to less than 4,000. What could be causing this discrepancy? Unfortunately, I can’t share the code and data as they are private. Any insights would be greatly appreciated!

Calculation of input tokens is done on every part placed in context. Everything the API call sends. This can be system messages, tools and function specifications, a bit of extra injection by OpenAI (now vision prohibitions), response format schemas.

Then of course, additional chat history besides the most recent, an amount that might be managed by your software. If using “Assistants”, the past chat of a thread should be intelligently truncated so you don’t get this kind of error.

You might look at images. One “oopsie” is sending the image file base64 data in a text section, which results in massive consumption instead of around 1000 tokens per image.

I see, thank you for your reply! do you think there is a way in the API to enable like how much tokens was in the reply and input to GPT? like in GPT-4/3.5 wherein if we reach the token limit it will tell us the input and reply tokens with the error message and not the whole message token?

With Chat Completions, you are in charge of sending anything “input”, which is the primary concern here. No output was ever generated because of sending too much input context. You got what you ask for: the error said “max 128000, you sent 249114”.

You send a list of messages. It isn’t one particular message that caused the error, unless it is indeed one that is malformed (like the image example, or trying to “upload” some other file. It is the total. It is up to you to expire, prioritize, or obsolete that information which cannot fit into a model context, before you send it, by doing a proper calculation of all messages and other consumption.