OpenAI Assistant maximum token per Thread

For the OpenAI new Assistants API, I am looking at an example from the example notebook from here:

On the tutorial, it says the following:

‘’'Notice how the Thread we created is not associated with the Assistant we created earlier! Thread objects exist independently from Assistants, which may be different from what you’d expect if you’ve used ChatGPT (where a Thread is tied to a model/GPT).“”

Does this mean that within each Thread, we can push in the maximum amount of tokens for a model through Message?

Basically, we know that gpt-4-turbo has max token limit 128K. Does that mean that if create an Assistant, and split it to 5 threads, then each Thread can handle 128K? Or that 128K limit must be distributed among these 5 Threads?

The reason I ask about this question is because we have a tabular dataset (3000 rows) with schema doc_id (str) and text (str). Each document (text) is very long. We want to apply the chatcompletion API to each row with the new GPT4turbo model. Right now our process is that we send each row to the chatcompletion API one at a time with @retry. If Thread is not tied to the Assistants, then I can create multiple threads (say 10) and then for each thread let it run parallel to feed into the Assistant, and this will make our process faster, correct?


Good catch! I never realized that and yeah you only associate it upon running. So in theory, you can run your thread using different Assistants! Cool!

As for token limits, I think it is OpenAI who will manage it. You can probably run 100 messages that breaks the limit but under the hood, OpenAI truncates or do whatever magic they are doing to maintain token limit and context.

From the docs:

There is no limit to the number of Messages you can store in a Thread. Once the size of the Messages exceeds the context window of the model, the Thread will attempt to include as many messages as possible that fit in the context window and drop the oldest messages. Note that this truncation strategy may evolve over time.

1 Like

Thank you!

So you are saying basically that I can use this Thread to “spread out” data so even though the overall limit of tokens is unchanged, I can use the trick for data engineering to increase my throughput so that the processing time can be shortened.

Right now our process is that if i have a dataset as I described previously, I have to apply the chatcompletion function to each row iteratively one by one. In the future, I can simply create many threads and send mini-batches to each thread in parallel, so that the overall processing time is reduced. Am I understanding this correctly?

1 Like

Hey all! Steve from the OpenAI dev team here. We’re working on designing usage controls for thread runs in the assistants API, and I want to provide a preview of the proposed change and get your feedback.

What we’re proposing is adding a new parameter to endpoints that create a run:

  • POST /v1/threads/{thread_id}/runs
  • POST /v1/threads/runs

we would add an optional field, token_control to the payload that looks like this:

  token_control: {
   	max_run_prompt_tokens: int;
	max_run_completion_tokens?: int;

The idea is to internally limit the number of tokens used on each step of the run and make a best effort to keep overall token usage within the limits specified.

Let us know what you think of this idea and whether it will work for your use cases!


Hey Steve, it’s great to see some progress being made.

What would happen if the retrieval passes the allotted token limit?

Personally I just want to be able to add my own Assistant messages so I can potentially destroy, manipulate, and recreate a thread with my own RAG and Token management techniques

Hi Steve, thanks for this plan! I have some questions regarding this idea:

  1. if set the parameter, “max_run_prompt_tokens”, so the what would be cleared to only keep max_run_prompt_tokens? Is it going to clear the oldest chat history in order, or limit the instructions tokens? For example, we have defined 3 function calls in an assistant, and it might need more than 3 rounds of function callings to get the final response, how does the max_run_prompt_tokens use for this situation? Does it mean if the 2nd function calling exceeds the max value then assistant will response an Error?
  2. Can we have more detailed parameters, to let users able to configure, including: the number of rounds of chat history could be put into context, what kind of submit_function_tools could be put into context in the previous history.


1 Like

Hi Steve that looks a nice a way of limiting tokens especially that sometimes chat history is not relevant in some inputs and that would lead to more Cost consumption .

Is your solution implemented at the current date ?

I’m currently in need of this feature as my AI Assistant responses are way too long. I need to control the response length. When can we expect this feature to be released?

Some additional feedback. This feature is super important because in my case for example, it sometimes takes up to 30s for my assistant to generate a response. Which is unnecessary and provides horrible user experience. No user will sit around waiting 30s for a response from a chatbot.

1 Like

@jon-malave it’s now available with April 2024 updates. you can use token limit like in example


max_prompt_tokens = 500)

1 Like

I’m using the Assistant Playground, and I’ve sent a message with 25k tokens (checked with, once I do that, I get:

You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs:

I’m on the Usage tier 1 tier, so the limit of chatgpt-3.5-turbo which is 60k TPM should suffice.

UPDATE- I forgot to add credits to my balance, once I added credits it worked like a charm