Assistants API pricing details per message

hey there, we would like to understand the details of assistants API pricing.

For the assistants API, there are 3 parts:

  1. instructions
  2. previous messages inside the thread
  3. new message for the thread

Whenever I send a new message, will my token consumption include 1. instructions? 2. previous messages inside the thread?


the assistants API just uses the existing models - all three are considered messages, with each of them having role (instruction should be system, also there’s user, function and assistant) and content (functions further have name). the tokens of each message’s content value is considered. And because you’ll probably send the whole conversation, including instructions etc., to the API every time you expect a reply (message with role “assistant”) - you’re being billed for all of them over and over again; this is, unless they are trimmed.

you should programmatically trim them or cap the context size (depending on your needs) for pricing to not explode - if a long gpt-4 thread has the theoretical max content size of 128k tokens, each new interaction with the thread could become expensive.


Ya as far as I can tell this is what is happening . Hopefully there is way to control this a bit more, as the cost for my simple assistant was roughly $0.90 (GPT4-1106) while only asking less than 10 questions. This is not really economical to use all the time…


Thank you for this note. Didn’t even think about it. Was playing around with assistant in playground: created one for a 750 page labor contract and another with a 12,000 page biblical scripture. First one worked great. Second I waited over 5 minutes for an answer until I read this post. Looked at usage: $1.26!!!

Buyer beware!


I’m confused about the assistant API costs.
If we incrementally add messages to the thread, we pay for those tokens. Why do we need to resend the whole conversation to the server over and over for each user interaction?

I was hoping using threads could reduce our costs of keeping the history on our side and send the whole big conversation to the server to persist the context for the OpenAI.

I don’t think that’s how it works. From what I’ve seen, the server doesn’t have persistence (otherwise they would have said so at the conference). That’s just how the API is laid out. The Assistants suite is just a way for not so tech-savvy users /devs to implement GPT into their application.

You’re right that you’re only sending the new messages to the thread but all the other messages are being stored serverside and are used when you run the assistant. All the previous messages in the thread + your new one are then run through GPT4+ so you’re paying for all the tokens when you run it, everytime. It’s an accumulation of tokens over time, every time you add a new message.

You add a message, run it, context looks like this.
User: Why is the sky blue?
Assistant: Because

You add another message
User: Give me more details
Assistant: It’s not red

Context that is used and charged looks like
User: Why is the sky blue?
Assistant: Because
User: Give me more details
Assistant: It’s not red

So everytime you add a message, IF you run it you’re charged for ALL tokens you ever added to that thread plus the assistant responses PLUS any messages the system used to perform functions retreival


Can you cap the context size inside the assistant dashboard ?

1 Like

Yeah, this is really annoying.
To me it is nonsense to pay for the context that I have already sent over and over.


It’s not about sending, it’s about how it has to use those tokens in the model again which is what you’re paying for. The more tokens in the model, the harder and longer it takes so it cost more.

1 Like

You could delete older messages from the thread to limit the context size if you wanted to


Internally, regardless of what the API looks like, it may always come down to GPT with a chat completion at the end. And that incurs costs which have to be paid.

@ryandetzel I don’t think there is an API to delete message from existing threads yet


I understand it appears there needs to be a nuanced understanding, because
$0.72 for the following, and I didn’t type one line of code in either. Just playing in the playground. This is risky, and some guardrails should be put in place to at least do some logging, or if it’s going to be in a “attractive user friendly playground” make it more clear. If I was coding this with an functions, and API, and etc in VS Code, that is a different story. I know what the risks are… Regardless …

5 threads it took, to get it to actually respond the way I wanted, plus the details below.

User a helpful assistant. That is an expert at a topic and you are to only use the 30 page PDF i use as your source of knowledge. Return your answers to me, so that a 5 year old can understand them.

Playground Instructions:
You are a helpful assistant who understands the importance of emotion and will always try as hard as possible to please the user. You will understand this by using access to the emotion_prompts.pdf.

Model - GPT-4-1106 Preview

Tools - Retrieval

Files - 35 page pdf.

1 Like

You should try gpt3, its pretty cheap and im trying to use it for an ai assistant thing


Agreed I have already done exactly what the “Assistant” Api does, except it is all within functions and files in my backend code. The only bonus is the file retrieval and code interpreter which you may not even need.


Ya that is what I have been using for my app (BlogeaAI), which it does work fine. But I was hoping to be able to use gpt-4, but the costs still might be too high for what I am trying to do…

Yep I could use GPT3 with LangChain and Llama Index and a free vector store to do all this. However to take in (with ease) the knowledge, the multifunction calling, PDF’s, videophoto tools, those all Retrieval costs are where this can get very pricey very fast. I think that was the premise of most of this dev day was to show that it shouldn’t cost me 1 dollar to read a 30 page PDF and get a summary of it, espceially not in the playground.

Yes I can build a RAG solution with GPT3 api very VERY affordably, and it will still be dated, and I want to be excited that the barriers to building just got a little easier (but also finding traction for a marketable solution is created).

I’m grateful for all the feedback, and again I agree, other models would cheaper way cheaper. I could do the same RAG and ReACT agentic builds and just enjoy that the other costs have been remarkably reduced now and continue to build and iterate.

1 Like

Except when you do it yourself you can be forced to write the code and think, such as counting tokens and keeping chat history length (threads, as poor a word choice as GPTs) at your level of management, as well as ensuring the AI knows exactly what it’s been calling since last responding to the user and giving it iteration limits.

1 Like

And this, in my opinion, is exactly why “wrapper” companies aren’t going away any time soon. Today, using the existing technology, they can do what these assistants can do – but for pennies instead of dollars. If you ask me, that’s a clear market opportunity.