I need to use the API to generate lengthy (1000-2000 word) personalized reports. For the longer reports I’m often left with the AI saying “repeat for other categories” instead of actually completing the report.
My prompt needs to be quite long (500-1000 words) to give context + instruction. My hypothesis is that my request takes up too much of the 4096 response token limit. So the solution might be to move the standard context part of the prompt out of the API request/reponse (which is limited to 4096), and rather have it in the context window (which limit is 128 000). How can I do this?
Chat: Will it help to move it from the “user prompt” to the “system prompt” in payload sent to the “Chat API”?
Assistants: or should I switch over to the “Assistants API” and then set up the context under “instructions” on the OpenAI website, and only send the instruction (user prompt) in the payload sent to the API each time?
Thanks!
You seem to have a bit of misconceptions, which I can probably clear up:
context window length is the 128000 characters (125k). and it is shared for all language inference.
The only thing confusing is that OpenAI artificially capped the amount of response in that space that they will generate and send to you - and then went further, training the AI itself not to produce more than about 500 words.
Loading the system instructions, past chat, functions, extra knowledge is all formatted plain text that appears in a linear space used for calculations. What the AI outputs one token at a time builds upon that as the AI is always considering what to output in a one-directional manner.
So about the only concern is the lack of focus as the extra input gets longer and longer – the need to pay attention to all the input at once to see if the "talk like a pirate that’s part of the input is still relevant and reweighting the output.
It’s very common that the output tokens remain significantly below the 4096 output token limit. ~800/900 words tend to be on the upper end of what the model returns in one API call. None of your proposed actions will change that.
@_j recently made a few good posts about the issue but I struggle to find them right now.
Edit: Here is one of the posts that speaks to that:
Thank you guys! So, if I understand correctly, reducing the extent to which my prompt eats up some of the 4096 tokens will not make a difference, since the current constraint to reply length is the internal training of the model, not the 4069 token limit… do I have that right?
Then, just to make sure I understand the concepts (and with the hope of change in the internal training in future models), will it in theory help (even if currently in practice it doesn’t), to move the context out of the request, in the two ways described?
I’m confused - did I miss your message here earlier? Sorry, if I did, in which case my response would have not been necessary as you already cleared it up.
No you didn’t, your 1st msg is what cleared it up for me. Do you mind giving your opinion on “Then, just to make sure I understand the concepts (and with the hope of change in the internal training in future models), will it in theory help (even if currently in practice it doesn’t), to move the context out of the request, in the two ways described?”
Since I need to decide whether to change over from the Chat API to the Assistants API… (your general commentary on whether this is advisable, would also be welcome)
Both system and user prompt/instruction count towards the 128k context window, so it would not matter. For the Assistant the instructions are also included when making a request.
I understand, but the assistant instructions are set once-off on the OpenAI website, so it is not included in the 4096 response/request limit, as it is not sent as part of the API payload, am I right?
Assistants are a lot harder to manage currently in terms of input and output tokens relative to a normal API call. There’s a lot of nuances around how an Assistant operates - but when you initiate a so called run, the Assistant’s instructions are part of the API request.
Based on my use, I’ve never seen a difference in the output token patterns described in the earlier messages for an Assistant vs. the regular API.
If you are looking to create reports, then doing that through an iterative approach is considered the “best practice” and you are also likely to get better results through that than trying to have the model output a lengthy document in one go.