4096 response limit vs 128 000 context window

alwyn · February 27, 2024, 10:59am

I need to use the API to generate lengthy (1000-2000 word) personalized reports. For the longer reports I’m often left with the AI saying “repeat for other categories” instead of actually completing the report.
My prompt needs to be quite long (500-1000 words) to give context + instruction. My hypothesis is that my request takes up too much of the 4096 response token limit. So the solution might be to move the standard context part of the prompt out of the API request/reponse (which is limited to 4096), and rather have it in the context window (which limit is 128 000). How can I do this?

Chat: Will it help to move it from the “user prompt” to the “system prompt” in payload sent to the “Chat API”?
Assistants: or should I switch over to the “Assistants API” and then set up the context under “instructions” on the OpenAI website, and only send the instruction (user prompt) in the payload sent to the API each time?
Thanks!

_j · February 27, 2024, 11:14am

You seem to have a bit of misconceptions, which I can probably clear up:

context window length is the 128000 characters (125k). and it is shared for all language inference.

The only thing confusing is that OpenAI artificially capped the amount of response in that space that they will generate and send to you - and then went further, training the AI itself not to produce more than about 500 words.

Loading the system instructions, past chat, functions, extra knowledge is all formatted plain text that appears in a linear space used for calculations. What the AI outputs one token at a time builds upon that as the AI is always considering what to output in a one-directional manner.

So about the only concern is the lack of focus as the extra input gets longer and longer – the need to pay attention to all the input at once to see if the "talk like a pirate that’s part of the input is still relevant and reweighting the output.

jr.2509 · February 27, 2024, 11:20am

Hi Alwyn and welcome!

It’s very common that the output tokens remain significantly below the 4096 output token limit. ~800/900 words tend to be on the upper end of what the model returns in one API call. None of your proposed actions will change that.

@_j recently made a few good posts about the issue but I struggle to find them right now.

Edit: Here is one of the posts that speaks to that:

alwyn · February 27, 2024, 12:03pm

Thank you guys! So, if I understand correctly, reducing the extent to which my prompt eats up some of the 4096 tokens will not make a difference, since the current constraint to reply length is the internal training of the model, not the 4069 token limit… do I have that right?
Then, just to make sure I understand the concepts (and with the hope of change in the internal training in future models), will it in theory help (even if currently in practice it doesn’t), to move the context out of the request, in the two ways described?

jr.2509 · February 27, 2024, 12:51pm

I’m confused - did I miss your message here earlier? Sorry, if I did, in which case my response would have not been necessary as you already cleared it up.

alwyn · February 27, 2024, 12:54pm

No you didn’t, your 1st msg is what cleared it up for me. Do you mind giving your opinion on “Then, just to make sure I understand the concepts (and with the hope of change in the internal training in future models), will it in theory help (even if currently in practice it doesn’t), to move the context out of the request, in the two ways described?”
Since I need to decide whether to change over from the Chat API to the Assistants API… (your general commentary on whether this is advisable, would also be welcome)

jr.2509 · February 27, 2024, 1:23pm

Both system and user prompt/instruction count towards the 128k context window, so it would not matter. For the Assistant the instructions are also included when making a request.

alwyn · February 27, 2024, 1:32pm

I understand, but the assistant instructions are set once-off on the OpenAI website, so it is not included in the 4096 response/request limit, as it is not sent as part of the API payload, am I right?

jr.2509 · February 27, 2024, 1:47pm

Assistants are a lot harder to manage currently in terms of input and output tokens relative to a normal API call. There’s a lot of nuances around how an Assistant operates - but when you initiate a so called run, the Assistant’s instructions are part of the API request.

Based on my use, I’ve never seen a difference in the output token patterns described in the earlier messages for an Assistant vs. the regular API.

If you are looking to create reports, then doing that through an iterative approach is considered the “best practice” and you are also likely to get better results through that than trying to have the model output a lengthy document in one go.

Topic		Replies	Views
Question about token limit differences in API vs Chat Documentation chatgpt , api	5	4325	May 26, 2023
Long Prompt with Large Text Data Prompting gpt-35-turbo , chatgpt , api	3	10528	July 14, 2023
Chained Prompt to complete text larger than 4000 tokens? API	14	5642	December 25, 2023
Need clarification on the context window length API gpt-4	4	2890	July 9, 2023
How to build a Question and Answer Bot for context greater than 2048 tokens? Prompting	3	1643	December 17, 2023

4096 response limit vs 128 000 context window

Related Topics