I’m trying to get a clear understanding of the ASSISTANT API pricing, specifically with my current setup. Here’s what I have:
- A single assistant
- Model: gpt-4-turbo-preview-1106
- 4k tokens for input instruction only
- Outputs average around 200 tokens
- 20 files totaling slightly less than 0.01GB each
After running about 10 messages in a single thread with the same assistant, I noticed I was charged $3. I’m trying to break down the costs and would appreciate some insights.
Here’s my understanding of the pricing model:
dailyCost = fileCosts + modelCosts
fileCosts = fileSize * $0.20
modelCosts = (kTokens * $0.01) + (200Tokens * $0.03) // for input + output
Additionally, I came across two points regarding the retrieval tool pricing that I’m struggling to comprehend:
- The Pricing page states: “Retrieval $0.20 / GB / assistant / day (free until 01/12/2024).” Does this mean my current costs are entirely dependent on model costs?
- More intriguingly, there’s a statement: “In addition, files attached to messages are charged on a per-assistant basis if the messages are part of a run where the retrieval tool is enabled. For example, running an assistant with retrieval enabled on a thread with 10 messages, each with 1 unique file (10 total unique files), will incur a per-GB per-day charge on all 10 files (in addition to any files attached to the assistant itself).”
Regarding point number 2, I’m not sure what this implies and how it could affect me given my setup.
Could anyone provide some clarity on this? Thanks in advance!
You are way off base with the calculation of your “model costs” price, which simply cannot be made because the internals are autonomous, iterative, undocumented.
gpt-4-1106-preview: 128k token context length, max 4k output (AI resists output over 1k)
Threads don’t have a size limit. You can add as many Messages as you want to a Thread. The Assistant will ensure that requests to the model fit within the maximum context window,
As the run progresses, the assistant appends Messages to the thread with the
role="assistant". The Assistant will also automatically decide what previous Messages to include in the context window for the model. This has both an impact on pricing as well as model performance
Retrieval currently optimizes for quality by adding all relevant content to the context of model calls.
Note that you are not charged based on the size of the files you upload via the Files API but rather based on which files you attach to a specific Assistant or Message that get indexed.
The retrieval is also based on iterative browsing function tool calls: part of the documentation disguises this, part of the documentation points out the requirement for models that emit parallel tool calls to use this.
Essentially, there is no predicting and no controlling the usage and costs. The AI might internally get stuck in a loop of calling with errors while loaded with 100k of context tokens. You can at best estimate the minimum you will spend, but not the maximum.
Understood. So, regarding minimum costs, is given calculation right?
If you disable the unpredictability, which is disabling code interpreter, disabling retrieval, not using tools, then…
the amount of input should be the tokens of the current user message — PLUS all the previous user messages and assistant responses in a thread, up to the model context length limit. (edit: plus the instructions, including unseen instructions, also included each run)
Assistants does not have a parameter where you can specify the maximum number of chat turns or chat tokens from thread conversation history to pass to a new AI model call. The maximum is the model used.
So then input cost per question depends on how long you let a user engage in the same thread before they are forced to start anew.
The file costs are GB/day - multiplied by the number of assistants files are attached to. So if you make five assistants with different instructions, they each get a daily bill for the same connected file. OpenAI hasn’t clarified the partial use of GB storage, but it seems that it might start at $0.20.
Code interpreter is billed by how many threads are left with code interpreter sessions open. Again the price page is absent further clarification. One can find that the unengaged chat session’s code interpreter should stop counting after an hour, so a day of a thread unused (unless it is close to date change) should not be billed.
All in all, with reports of “I tried it again, got hit with $70” or “we were billed $200 in a day”… it is something to watch and wait for improvement in documentation and function.
Clear enough! Thank you very much for your time!
Has there been any further clarification on these costs since the last post on this thread?
I’m trying to come up with a commercial pricing model for around 100 clients requiring their own custom built Assistant, but I have no idea how to ensure our ongoing operational costs are covered. We can’t price this on a per usage basis to our clients so this is not an option.
Any advice or feedback on this would be greatly appreciated.
Someone that wants to complain about their limit of 40 messages within three hours could easily rack up more than $40 in billing when the assistant has multiple features enabled, such as retreval documents which have the 128k gpt-4-1106 context length loaded up, with then the agent making many internal iterative calls per input.
If you do not charge per usage, you will be creating an application that is the best value for those that exploit your pricing model and the lack of budget configurability or transparency of assistants.