Seem to be unable to reach context limit in my API request

I was trying to see how my app would behave when I exceeded the context limit. I tried this:

 const initialMessageContent = `Some text "${longText}". 
    Also please consider this "${longText}". 
    Also please consider this "${longText}".
        Also please consider this "${longText}". 
    Also please consider this "${longText}".
        Also please consider this "${longText}". 
    Also please consider this "${longText}".
        Also please consider this "${longText}". 
    Also please consider this "${longText}".
   Thank you.`;

    console.log('Intial Message Lengths', initialMessageContent.length )

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    stream: true,
    messages: [
      {
        role: 'user',
        content: initialMessageContent
      },
    ]
  });

I got a console log: “Intial Message Lengths 308153”
But no error, it just seemed to process it. Would have expected to get an error.

Can become a bit costly to experiment with this :sweat_smile:, so don’t want to dive deeper. Anyone any thoughts why I am not getting an error?

I can’t speak as to why you didn’t get an error but I know that if I need to restrict tokens that I have leveraged the max_tokens parameter for this purpose: https://platform.openai.com/docs/api-reference/chat/create#chat-create-max_tokens

Thanks. Just to confirm my understandning:

max_tokens is only limiting the output only, right?

If I set max_tokens: 10 and my request prompt has 1000 tokens it will still process it, right?

1 Like

You can use a Python library like Tiktoken (check my spelling on that) perhaps to calculate the number of tokens yourself and then just trust that OpenAI is telling the truth about their supported context window size, and enforce it on your end. Or do what I did, and just do a word count and then multiply that by a reasonable approximate average number of tokens per word, and then enforce it without being totally precise just by counting words, or character length. There’s no real point in trying to force OpenAI to throw an error when you know how to stay within the limits, imo.

2 Likes

Your understanding is correct. The max_tokens parameter only applies to output tokens. It works as a hard cut-off whereby the output is cut at the number of tokens specified as limit, independent of whether the output is complete at that point.

As for input tokens, there is no equivalent parameter, and I’d echo @wclayf’s recommended approach.

It’s interesting though that it doesn’t throw an error in your case. For other models I have not experienced this before.

1 Like

Hi!
When the context window is exceeded the first tokens will be dropped, only the last tokens that fit into the window will remain in context.

You can test this by asking for specific information from the beginning of the context and then increase the number of input tokens until the model can’t see or derive what those initial tokens should be.

As @wclayf correctly pointed out there is not really a use for this that can’t be handledin a better way.
Or is there something else you are looking for besides a warning from the API?

1 Like

Thanks for clarification. I understand this can be handled by the developer/user of the AI.

But maybe a warning would be useful that indicates that the request prompt was in fact cut-off / effectively “modified” before giving a response, could otherwise lead to unexpected behaviour.

1 Like

I thought about this argument because why not just give a warning to the developer that the context length is exceeded?

However, the behavior that follows is rather clearly defined: ‘Drop the first tokens, process whatever is inside the context window,’ and not seemingly random in the sense of how ‘undefined behavior’ is often used.

This example is a very basic requirement for developers working with current-generation LLMs. In my opinion, this is a must to be aware of when building with these costly tools.

Sorry for the late reply @rene-ka

I see others have jumped in and gotten you up to speed. Token counting can be tricky as has been noted and you may want to create a custom solution but max_tokens can be a stop gap at the very least.

I guess this is a different debate to have whether it would be useful to have a warning. I wrote “unexpected behaviour”… But I guess a different debate. Thanks for your reply and answering my question.

1 Like