When using Playground, what happens if total system/user/assistant prompts exceed max token length

mattrosine · January 9, 2024, 9:49pm

Hi all. I’m experimenting with gpt4-1106-preview in the playground.

I am putting a lot of work into the system prompt, and then using the user/assistant prompt to further increase the quality of output.

If I add up all of the tokens used in system/user/assistant, it adds up to 8575 tokens. The max number of tokens for this model is 4095.

Two questions:

What is happening behind the scenes if my input is twice the length of the tokens allowed? Is it forgetting half of the input? If so, which part is it forgetting? Do I need to get the system/user/assistant way down so it both remembers all of the information and also has enough left to generate the output?
Is my token length between system/user/assistant of 8575 completely crazy? I understand the cost consequences, but I wanted to know if anyone routinely makes API calls of this size.

Have I completely misunderstood how all of this works? I want to make sure I understand whether a poor quality output is because of the prompt engineering, or if the model is just missing out half of my inputs.

Thank you!

_j · January 9, 2024, 9:59pm

You’ve been confused by unclear terminology lacking central authority.

The model gpt-4-1106-preview is unusual in that it has a limited output, by OpenAI choice and enforcement.

I’ll introduce a new term, called context length, context window, or even context window length. This is the AI’s total memory for working with tokens. That is both the input tokens supplied by a user and the response that is formed and continues after that input.

gpt-4-1106-preview total model context length is 125k, so it can accept much more input.

max_tokens is a parameter that can be set by API to limit the output so the generation stops after a certain length. It also acts to reserve context length just for forming the output (just by the endpoint math).

So essentially, if you don’t specify max_tokens, unlike other models where all remaining context length can be used for output, a 4k maximum output is set for you (and far less is preset on the vision model)

The output will stop when it hits the max_tokens limit or the context length has no remaining space.

The API request will be refused if you send more input + max_token specification than the model can handle.

(I’d show you some huge single requests from the past, but the ability to demonstrate such has been destroyed in the revised usage page)

mattrosine · January 9, 2024, 11:00pm

Thanks for the reply. At this stage, the explanation you’ve provided is a little over my head.

You’re saying that the model can accept 125k, but I’m being limited to 4025? Is that a universal thing right now, or varys from account to account?

max_tokens is a parameter that can be set by API to limit the output so the generation stops after a certain length. It also acts to reserve context length just for forming the output (just by the endpoint math).

You’re saying that if it needs to use tokens for the output, it’ll reduce the amount of tokens used for the context?

So essentially, if you don’t specify max_tokens , unlike other models where all remaining context length can be used for output, a 4k maximum output is set for you (and far less is preset on the vision model)

I am specifying max tokens. I’m setting the slider all the way up.

The output will stop when it hits the max_tokens limit or the context length has no remaining space.

The API request will be refused if you send more input + max_token specification than the model can handle.

So, if I’m receiving the full output as requested, and I’m not receiving any error message, does that mean everything is working as it should? If thats the case, how is that possible if my total context credits is double what the model says is allowed?

My issue wasn’t that output was getting cut off. I was getting suspicious that the quality wasn’t as high, so was wondering if it was ignoring my inputs…

_j · January 9, 2024, 11:21pm

Each AI model has a particular context length.

Some models have a limit imposed on their output by OpenAI.

You are confusing the output limit on some with the amount of input that can be accepted - as long as you still leave room for a response to be written.

Refined just for you:

Model	Input cost 1M	Output cost 1M	Context Length	Output limit
GPT-3.5-turbo-1106	$1.00	$2.00	16k	4k
GPT-3.5-turbo	$1.50	$2.00	4k
GPT-3.5-turbo-16k	$3.00	$4.00	16k
GPT-4-1106 (turbo)	$10.00	$30.00	125k	4k
GPT-4	$30.00	$60.00	8k

Costs are billed per token but shown per million for clarity

definitions:

Model: Specific API model or current stable API model alias
Input Usage: Cost per 1M total tokens sent to model
Output Usage: Cost per 1M total tokens generated by model
Context Length: Shared model area for accepting input and response creation
Output limit: artificial limit imposed on model use

The more you send as context, such as past chat history or documents, the more your instructions will tend to get lost within that large input that the AI has to consider in whole when generating output. Larger inputs thus can have the effect of lower quality instruction-following.

supershaneski · January 9, 2024, 11:49pm

Your observation is correct, for Assistants API,

Threads don’t have a size limit. You can add as many Messages as you want to a Thread. The Assistant will ensure that requests to the model fit within the maximum context window, using relevant optimization techniques such as truncation which we have tested extensively with ChatGPT. When you use the Assistants API, you delegate control over how many input tokens are passed to the model for any given Run, this means you have less control over the cost of running your Assistant in some cases but do not have to deal with the complexity of managing the context window yourself.

So, while you are pampered with not having to think about the context length, you need to take in mind the downside.

mattrosine · January 9, 2024, 11:51pm

Hi there.

I’m using the Playground as I want to test out how the API will perform. Does that change any of the information you gave me?

Or is the Playground API the same as the Assistants API?

supershaneski · January 9, 2024, 11:52pm

It should be the same behavior in Playground and deploying the Assistants API somewhere else.

_j · January 10, 2024, 12:05am

Just to note: I identify the usage as being the chat completions endpoint, from the language of writing a “system message”. Not the Assistants agent.

The chat completions endpoint on API with your own code in non-streaming mode also gives you insights into the token consumption. You may want to transition from playground testing to code-based testing, getting the usage out of the full response.

supershaneski · January 10, 2024, 12:27am

You’re right. I assumed OP is using Assistants API since they are not hitting the limit. There is a difference between Chat Completions API and Assistants API in managing context.

mattrosine · January 10, 2024, 3:00am

I am talking about Chat Completitions API.

Not hitting the limit is what I’m confused about. I still don’t understand how I’m able to use a prompt/completition thats over 8k tokens and it works?

brandon.fryslie · December 24, 2024, 7:36am

Because the ‘gpt4-1106-preview’ context length is 125k tokens. 8k tokens is for GPT4, not the model you specified.

Topic		Replies	Views
Trying to understand why I'm hitting token limit with API API gpt-4 , api	8	3918	February 27, 2024
Test new 128k window on gpt-4-1106-preview API	29	18180	February 6, 2024
4096 response limit vs 128 000 context window API	11	10079	February 6, 2025
Not allowed to have all 8192 tokens API gpt-4	16	10625	December 18, 2023
Maximum token length allowed API	9	33404	December 13, 2023

When using Playground, what happens if total system/user/assistant prompts exceed max token length

definitions:

Related topics