Impact of Instruction Size and Thread Length on Token Usage in OpenAI Assistant

Hello,

I am seeking clarity regarding the token utilization in the OpenAI Assistant’s responses. Specifically, I am interested in understanding how the size of the assistant’s preset instructions influences the number of tokens used in processing each request. Does a larger set of instructions result in a higher token consumption for each interaction?

Additionally, I am curious about the token economics in relation to the age and length of threads. When we interact with the Assistant in a thread that has accumulated a significant number of messages, does this affect the token usage compared to initiating a conversation in a new, message-free thread? Essentially, I’m trying to discern whether the historical length of a thread has an impact on the token cost of subsequent interactions within that thread.

Any insights or technical explanations regarding these aspects would be greatly appreciated.

1 Like

There isn’t clear information on assistant token usage yet. Not on the documentation and not on this forum I’m afraid.

Still, it would be natural that the pre-set instructions will consume tokens per message, and that the more tools you use (dalle, retrieval, code interpreter) the longer these instructions are. You can try starting a new assistant with gpt-3.5 with all the tools and no additional instructions and ask it to give you the instructions is has verbatim. This can give you an estimate.

Alternatively, or additionally, you can use your usage page to monitor your billable tokens before and after your interaction, to evaluate the final cost and compare that with the estimations you’d have from the other method. This way you can get to a correlation from how many tokens inputed in instructions to how much you’re billed for 1st, 2nd, Nth message within a thread.

You can see other threads where colleagues are trying to get to the same predictability that you and I are also looking for, without success:

2 Likes

Hello Jorge,

Firstly, I’d like to extend my gratitude for your insightful response. It’s always enlightening to engage in these knowledge-sharing exchanges.

To add a bit more context to my query and foster a deeper understanding, I conducted an experiment. I used the OpenAI Assistant with varying instruction sizes and in different thread types. Here’s a quick summary of what I found:

Assistant ID GPT Model Thread Type Instruction Length (Num Characters) Cost for 2 Questions (Answers < 50 Tokens)
1 gpt4-preview new 30,708 $0.15
2 gpt4-preview new 6,125 $0.03
2 gpt4-preview old 6,125 $0.03

These results seem to indicate a clear relationship between the length of instructions and the associated cost. However, the thread’s age and message history appear to have less impact on the cost for similar interactions.

4 Likes

The cost of unregulated context usage is going to be addressed in an upcoming assistants update, no timelines on that as yet, but it should allow developers to control the context size/usage.

I should also point out that if cost is currently an element that is having a significant impact on your project/product, you should understand that AI is moving very rapidly and in 12 months (nothing in project terms) costing will be significantly reduced and will tend to zero over time. Long term planning for AI based projects needs to consider taking market share today as a priority and balancing that against the initial cost.

1 Like

Gracias for sharing your results Luis, that adds up and is in line with what we’ve seen. The instructions are of course counted in any new thread.

What do you mean by old and new threads though?

1 Like

Sure,

With “new” thread I start a new conversation with one of my assistants, without any old messages stored in It.

And with the old one I send the same messages to a thread of that assistant that has 40 more old messages stored in that thread.

Not sure when it was added, but for now I see tokens usage on a particular run.
You’re receiving usage data on retrieve run endpoint (run must be completed): https://platform.openai.com/docs/api-reference/runs/object#runs/object-usage.
It looks pretty straight forward for calculation now.

According to my tests, instructions + attached files counted as input tokens on each run.

1 Like

Hello everyone. Thank you for your responses, but somehow it didn’t become clearer. Still, does the content of threads count towards the incoming tokens or not? The thing is, threads can be quite substantial, and passing a whole thread with a request to the model can significantly affect the cost of that request. The Playground doesn’t provide an answer to this question, it gives the impression that the content of threads is not included in the incoming tokens, whereas there is an alternative opinion on the forum. Please share your experience, perhaps someone has already figured out this question for sure.

just dump the json after completion, it will give you a token count. I think this will give you at least a relative reference for counts plus you can view what is happening in the json object with the model.

import json
print(json.dumps(json.loads(response.model_dump_json()), indent=4))