Assistants API Pricing and Token Usage

I’m starting to go for more of a hybrid thing. To the user it all looks the same but it’s mostly a standard chat with history managed on my end combined with tightly controlled assistant mini sessions. The runs and steps with code interpreter are so nice and so powerful.

To my knowledge NO chunking is done before the 128k (GPT-4) window is exceeded. There is no real proof that the chunkng is even working afterwards. You are paying for all files used and their corresposnding tokens usage. So there is no gain when using the assistant when it comes to the costs. To me its not clear if there is a real benefit right now. Possibly in the future but for the moment I would stick with GPT3-4 / 3.5 and the chat API. Do the message history yourself, include the file content as part of the user message… This should lower the cost and give you finer control over the model.


History management feels the same as chatgpt. Would like to know more about how they are doing it. After about the same length of conversation assistants start to fall apart in the same way.

The documentation does mention that assistants is currently aimed at quality not cost or performance, and that those options will be added in time.

1 Like

was asking to gpt-4-1106-preview
does this right ?

Conversation :

:dizzy: Topic: OpenAI Model Pricing Table

:robot: :gear: Message From ChatGPT (System):

You are an assistant that OpenAI Model Pricing Table

:nerd_face: Message From You:

can you generated price list of models openai based of your knowledge in table ?

:robot: Message From ChatGPT (Assistant):

Certainly, I can provide a hypothetical example of what a price list for OpenAI models might look like. However, as of my last update in April 2023, OpenAI had not publicly released a detailed price list for individual models. Pricing for access to OpenAI’s API, which includes models like GPT-3 and its successors, typically depends on the amount of compute used, measured in tokens processed.

Here’s an illustrative example of what a pricing table could look like, assuming OpenAI would charge based on model complexity and usage:

Model Description Base Rate per 1,000 Tokens Monthly Subscription (Optional)
GPT-3 Standard language model $0.06 $100 (includes 2M tokens)
Codex AI for coding $0.10 $200 (includes 1.5M tokens)
DALL-E Image generation $0.15 per image $300 (includes 200 images)
GPT-4 Advanced language model $0.08 $150 (includes 2.5M tokens)
GPT-4-1106 Specialized version of GPT-4 $0.09 $200 (includes 3M tokens)
ChatGPT Fine-tuned for conversation $0.07 $120 (includes 2.2M tokens)
DALL-E 2 Enhanced image generation $0.20 per image $400 (includes 250 images)


  • The “Base Rate per 1,000 Tokens” is a hypothetical cost for processing 1,000 tokens, which typically equates to about 750 words.
  • The “Monthly Subscription” is a hypothetical package offering a set number of tokens or images at a fixed price, potentially with a discount compared to the base rate.
  • The prices above are purely illustrative and do not reflect actual OpenAI pricing.

For the most accurate and current pricing, you would need to check OpenAI’s official website or contact their sales team directly. Pricing models can also include additional factors such as the type of usage (commercial vs. research), volume discounts, and premium features like faster response times or custom model training.

I suppose retrieval tool in Assistants work like any other RAG solution out there. IE: it gets an embedding of the question, search the index for semantically close chunks of the files and build a prompt where the content of the chunks is added as context.

I can’t include links yet, as this is my first post, but google “langchain Retrieval-augmented generation (RAG)” for a deeper explanation about that.

Now, what we don’t have visibility over is how many chunks does it include in the prompt, the chunk size when indexing (in tokens) and the similarity factor (vector distance between question and chunks).

Based on my prior usage of langchain doing “custom chatbots based on private content”, it can be pretty expensive.

Under the hood, the final prompt submitted to the model will be something like:

Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. 
Always say "thanks for asking!" at the end of the answer. 
Question: {question}
Helpful Answer:

Where context will be 5 or 10 or N chunks of text extracted from your files (think something like 3k tokens total), based on the chunk vector semantic similarity with the question (parts of your files that have a great chance of talking about the subject of the question).

Obviously OpenAI will be using some sort of fine tuning to avoid such a lame prompt, but the context part is inevitable.

Costly? Yes, but pretty much as costly as other similar solutions (plus the premium of an OpenAi solution).

Hi and welcome to the forum.
You are right, we need more clarity on how the assistant works under the hood.
We also need a way to estimate cost in a better way.

Currently, an approach over the chat completion API is better suited. Of course, you need to handle the messages and RAG functionality yourself. But at least you have everything under control.

1 Like

For that I suppose one could take a look at Dexter from

I went a bit crazy with assistants api and gpt-4-1106-preview and used more credit in 1 day than i had done in 1 month, i just didn’t realise. But no complaints. Openai emailed me to say i had exceeded my threshold limit that i had set. Just didn’t realise how expensive the combo could be. I’m now mostly using 3.5 turbo with completion api again but imo the assistants api with gpt 4 is really quite alot better (but can be comparitively expensive).


It would be beneficial to have usage metrics per assistant, per thread, per message, and per run request, similar to the call reply in the completion API. My objective is to divide usage per assistant and prevent the sending of messages once a specified usage limit (in tokens) is reached. This is a crucial API feature for many of us, and it should ideally be implemented already.

Are you currently working on this feature?

Thank you.

1 Like

it should generate token usage for each run (prompt + completion token counts) same with the chat completion API. the LLM adds ‘messages’ to the threads during ‘runs’.

1 Like

Don’t see a reason to report this on the run. I am sure we will see this later and that it is hidden for the moment because we would see that the system is consuming too many tokens.

It is not clear at all. At this point the assistant api feels like an alpha release, and called beta and all the while charging us an unknowable amount (until after the fact). Kinda scammy. I don’t think it is a grift but it seems like a pretty bad implementation. I mean honestly, if we can’t see our token use then how do we know when to cut the cord? And lets not forget the performance of the assistant api is pretty bad, like unusable. I don’t think it is good business practice to have us devs paying to kick the tires of their clearly not-ready-for-beta product. Feels like a cash-grab based on hype. “Hey everyone look at this cool thing!” (meanwhile no way to know how much it actually costs in use or if it works well at all)


pretty sure all the existing messages (users, assistants) in the thread are being read by the LLM every time you create a run - this is the only plausible explanation for why the LLM can add contextually relevant assistant messages after each run. And also why it makes sense to add the token counts on the run object (chat history = prompt token count, assistant message = completion token count)

I connected Assistant API to Facebook Messenger for replying messages purpose using gpt-4-1106-preview model. Then I used this token calculator to get the amount of token for both input and output ways.

Then I created a sheet to summarize the total cost. My sheet displayed $0.082 in total while the usage in my API dashboard told me that I was costed $0.82. That’s 10 times compared to the price announced.

After reading all the replies in this topic, I noticed that I had uploaded a file and set the Retrieve ON. That file has total >8000 tokens and when I put every input I made along with the file, the amount increased to exact the number of $0.82.

I don’t know if this true but seems like it’s very expensive when I upload files to Assistants. But without doing that, I cannot get good answer by the GPT.

one question:

while creating run in assistants, does tokens in instructions param count as a input on the top of the instructions provided while creation of assistant.

i’m using lots of tokens with assistants api and in assistants playground. i have been using big prompts and getting big responses but still, it’s getting too expensive for my liking. fantastic capabilities though.

I have a question regarding the submit_tool_outputs. I provide the assistant with custom tools and the agent is using a tool, the run pauses and the status changes to “required_action”. If i now provide the function return value with submit_tool_outputs to the agent, will i be billed the complete context of the thread? Or did openai find some clever way to avoid it?

If your refer GPT as a service, then I suppose thread is here to retain the state and it’s stored at server-site.

It’s possible to check out the token usage of Assistants API in ver.1.9.0, now!