OPENAI ASSISTANTS API PRICING - help pls

Hi guys,

I hope you are well. I am building a mobile app that is using the GPT Assistants API to create the avatar that users will be chatting with.

I am trying to create a financial model and it is super difficult to get a price going on how much the Assistants will cost because obviously it depends on tokenisation etc.

How do you think I can go about a sophisticated method to get as accurate as I can to making the assumption for the cost of the OpenAI Assistants for the AI model.

@eddie4 Please let me know if you get an answer to this

@eddie4 @naypat were you guys able to figure it out please?

@DhruvAwasthi Yes I was.

Here’s an example break down that may help you.

For 1,000 users asking “What’s the annual leave policy?” (7 tokens) and getting back 100 tokens in response using GPT-4:

  • Input: 7 tokens x 1,000 users = 7,000 tokens. Cost = 7,000 tokens * $0.03/1,000 = $0.21.
  • Output: 100 tokens x 1,000 users = 100,000 tokens. Cost = 100,000 tokens * $0.06/1,000 = $6.

Total token cost = $0.21 + $6 = $6.21.

For retrieval, if you’re using 1GB of data, it’s $0.20/GB/day. So, if we add that, your total would be $6.21 for the tokens plus $0.20 for the retrieval, making it $6.41 in total. You can then change that based on how large the average file would be in your context.

Prices can change, so check OpenAI’s pricing page for the latest. Hope that helps clear things up! Let me know if you need more clarifications.

You can calculate the rough estimate based on system prompt length and length of conversations you are looking to store as context. Here is the pricing list for different models.
https://openai.com/api/pricing/

Are you sure it works this way?

From what I have been reading and exploring, it seems like the document tokens are also added in each user message which sometimes can be up to 16,000 tokens.

@_j Can you please commend on this scenario?

@DhruvAwasthi

Yes. You can set limits on the Assistants response, you can customise how many tokens it can respond with. And you of course give the assistant the message, so you decide the input tokens.

All information is on the OpenAI pricing website, though not as easily understandable.

1 Like

Oh okay @naypat. Thanks for your help man!

When the AI within Assistants emits a tool call and receives a language response, this is also stored as messages within a thread with their own roles, and are not presented to you as messages you can retrieve. The AI then can continue with successive calls to a tool, or finally output to a user.

These tool responses will continue to be part of the thread, re-sent until they expire because of the limited model context length or because you use the truncation_strategy to limit the number of past turns.

The token figure for “document tokens” is from a private message that was solicited. The file search that can be done on uploaded documents is another type of tool the AI can internally use, with the results added to a thread as context. This token count is figured by the documents being chunked into 800 token pieces (with overlap also), and the search returning 20 chunks (with no relevancy cutoff employed), for 800x20 = 16000, with variation depending on how many partial chunks or chunks from small documents may be in results. (gpt-3.5 gets 5 chunks back)

There are new parameters where you can specify the number of returned chunks, along with the file chunking size when you create a vector store.

Assistants does not have a high quality limitation on its response, which must instead be done by guidance to the AI. The current parameters will only terminate output after still incurring cost, especially if setting unrealistically-low values that do not allow tools to operate.

1 Like

Sorry for the late reply.
Due to the limited flexibility, I moved to Langchain to build the RAG. Now all the things like retrieval, trimming the messages, etc. gives more flexibility to update them as we want. And this really seems to work.

Thank you for your detailed response!