Assistant API: How to set limit of context window for GPT-4 turbo

Hi OpenAI team and commnuity!

We are using Assistant API and GPT-4 turbo to implement some really advanced feature, and I really love the fact that turbo is 3x cheaper that GPT-4.

But for assistant API, when I create a thread and keep adding messages, since the context window is 128k, so when the window is fullfilled, every turn of conversation will cost me over $1, which is hard to afford for me and my end user.

So at least for assistant API, I have not benefited from this lower pricing of turbo, which I hope I could.

If I could not set a smaller limit of context window, like 8k, the turbo is indeed more costly that GPT-4, so I have to change back to GPT-4 (which is not cheap either).

And we don’t really indeed need 128k that long window for our product to work no matter the pricing concerns, but we do need the “3x cheaper” factor.

I really need this, once this feature is shipped, I could ship GPT-4 turbo in production, which is facinating for me and my 1000+ end users.


Assistants is unshippable.

  • The chat length (threads) is not under your control, and the assistants backend will fill the context with all you have and all that can fit.

  • The retrieval function is not under your control, and the assistants backend will fill the context with all you have and all that can fit.

At least that’s what they say about retrieval, when in fact the AI is given a “browse the files” function OpenAI decided to not document where it can autonomously make multiple calls, each with that maximum context.

  • Other repeated iterations are not under your control, and the assistants backend will encourage the AI to persistently call code interpreter and functions, even calling with the same text over and over (and you can only see this when it is your API).

  • The AI model quality is not under your control, and OpenAI has demonstrated again and again their willingness to apply changes and degrade AI models that are used in production. The preview model already has substantial problems calling functions, as a scan of this forum will reveal, which can result in loops without termination.

  • Actually getting an output after you’ve paid is not under your control. When there is an error and no response, or the model times out because it has the same occasional non-response API users see on the new models, you get charged.

  • The time the assistant will spend is out of your control, during all of this internal AI language writing, and you do not see a streamed response when it finally answers. Checking back later is not what users want, but is exactly what you must do to use the assistants: keep making network requests to find if a response is ready.

  • Finally, The costs are completely out of control, unaccountable, and OpenAI has done their best to hide them, deploying a useless “usage” page in concert with the devday release of assistants, and not giving you token counts in responses. Files double-billed for every assistant and every inclusion in a message, and a code interpreter that costs you for every user. Inferring the costs by only seeing an AI model’s daily costs is unbillable and intolerable.

Assistants absolutely should be avoided. Nobody “really needs this”. I could add dozens of citations, but that is the whole forum now (when it is not GPT promoters and novices.)

(my opinion)


The Assistant API is still new and there are a lot of parts of it which at the current moment, not under your control but under the API’s. Stuff like knowledge retrieval from documents or thread length is stuff that needs to be customisable. I would suggest holding out on using assistants in prod atm.

Also, if you want the same behaviour as an assistant but without the updated API, it can be done using messages and thread manipulation in the older version. A sample of it is in the openai cookbook

1 Like

Hi folks – I work on the Assistants API team and appreciate all your feedback. I know this is frustrating but I can assure you that all these things are on our roadmap to solve for in the coming 1-2 months. We will fix most of these things before we move from Beta to GA.

To specifically answer your question, @SheldonNiu – there is currently no way to specify a maximum number of tokens the Assistant or tool should use, but support for this is coming soon! Thanks.


Hi nikunj!
Thankful and looking forward to the update!
And thanks for making this amazing Assistant API, without which we can not make our product so intelligent and powerful :slight_smile:

@nikunj If you’ve got time please have a look at this post I made days ago regarding adjustment to the Instructions: Assistants API feature Adjustment| Thread run Optional Instruction
Thank you

thx nikunj! I really like the concept of the Assistant - before this API I have to build all the OOTB features using LangChain or whatnot…

that said, when will be the GA of Assistant API?
In general I agreed with the previous assessment… I would like to see the following updates:

  1. get rid of pooling to get the status
  2. provide the control on what should be exchanged between client and API
  3. overall pricing model: way to expensive - it nearly makes no business case to use Assistant API
1 Like