Hi all. I’m experimenting with gpt4-1106-preview in the playground.
I am putting a lot of work into the system prompt, and then using the user/assistant prompt to further increase the quality of output.
If I add up all of the tokens used in system/user/assistant, it adds up to 8575 tokens. The max number of tokens for this model is 4095.
Two questions:
What is happening behind the scenes if my input is twice the length of the tokens allowed? Is it forgetting half of the input? If so, which part is it forgetting? Do I need to get the system/user/assistant way down so it both remembers all of the information and also has enough left to generate the output?
Is my token length between system/user/assistant of 8575 completely crazy? I understand the cost consequences, but I wanted to know if anyone routinely makes API calls of this size.
Have I completely misunderstood how all of this works? I want to make sure I understand whether a poor quality output is because of the prompt engineering, or if the model is just missing out half of my inputs.
You’ve been confused by unclear terminology lacking central authority.
The model gpt-4-1106-preview is unusual in that it has a limited output, by OpenAI choice and enforcement.
I’ll introduce a new term, called context length, context window, or even context window length. This is the AI’s total memory for working with tokens. That is both the input tokens supplied by a user and the response that is formed and continues after that input.
gpt-4-1106-preview total model context length is 125k, so it can accept much more input.
max_tokens is a parameter that can be set by API to limit the output so the generation stops after a certain length. It also acts to reserve context length just for forming the output (just by the endpoint math).
So essentially, if you don’t specify max_tokens, unlike other models where all remaining context length can be used for output, a 4k maximum output is set for you (and far less is preset on the vision model)
The output will stop when it hits the max_tokens limit or the context length has no remaining space.
The API request will be refused if you send more input + max_token specification than the model can handle.
(I’d show you some huge single requests from the past, but the ability to demonstrate such has been destroyed in the revised usage page)
Thanks for the reply. At this stage, the explanation you’ve provided is a little over my head.
You’re saying that the model can accept 125k, but I’m being limited to 4025? Is that a universal thing right now, or varys from account to account?
max_tokens is a parameter that can be set by API to limit the output so the generation stops after a certain length. It also acts to reserve context length just for forming the output (just by the endpoint math).
You’re saying that if it needs to use tokens for the output, it’ll reduce the amount of tokens used for the context?
So essentially, if you don’t specify max_tokens , unlike other models where all remaining context length can be used for output, a 4k maximum output is set for you (and far less is preset on the vision model)
I am specifying max tokens. I’m setting the slider all the way up.
The output will stop when it hits the max_tokens limit or the context length has no remaining space.
The API request will be refused if you send more input + max_token specification than the model can handle.
So, if I’m receiving the full output as requested, and I’m not receiving any error message, does that mean everything is working as it should? If thats the case, how is that possible if my total context credits is double what the model says is allowed?
My issue wasn’t that output was getting cut off. I was getting suspicious that the quality wasn’t as high, so was wondering if it was ignoring my inputs…
Some models have a limit imposed on their output by OpenAI.
You are confusing the output limit on some with the amount of input that can be accepted - as long as you still leave room for a response to be written.
Refined just for you:
Model
Input cost 1M
Output cost 1M
Context Length
Output limit
GPT-3.5-turbo-1106
$1.00
$2.00
16k
4k
GPT-3.5-turbo
$1.50
$2.00
4k
GPT-3.5-turbo-16k
$3.00
$4.00
16k
GPT-4-1106 (turbo)
$10.00
$30.00
125k
4k
GPT-4
$30.00
$60.00
8k
Costs are billed per token but shown per million for clarity
definitions:
Model: Specific API model or current stable API model alias
Input Usage: Cost per 1M total tokens sent to model
Output Usage: Cost per 1M total tokens generated by model
Context Length: Shared model area for accepting input and response creation
Output limit: artificial limit imposed on model use
The more you send as context, such as past chat history or documents, the more your instructions will tend to get lost within that large input that the AI has to consider in whole when generating output. Larger inputs thus can have the effect of lower quality instruction-following.
Threads don’t have a size limit. You can add as many Messages as you want to a Thread. The Assistant will ensure that requests to the model fit within the maximum context window, using relevant optimization techniques such as truncation which we have tested extensively with ChatGPT. When you use the Assistants API, you delegate control over how many input tokens are passed to the model for any given Run, this means you have less control over the cost of running your Assistant in some cases but do not have to deal with the complexity of managing the context window yourself.
So, while you are pampered with not having to think about the context length, you need to take in mind the downside.
Just to note: I identify the usage as being the chat completions endpoint, from the language of writing a “system message”. Not the Assistants agent.
The chat completions endpoint on API with your own code in non-streaming mode also gives you insights into the token consumption. You may want to transition from playground testing to code-based testing, getting the usage out of the full response.
You’re right. I assumed OP is using Assistants API since they are not hitting the limit. There is a difference between Chat Completions API and Assistants API in managing context.
Not hitting the limit is what I’m confused about. I still don’t understand how I’m able to use a prompt/completition thats over 8k tokens and it works?