OpenAI team, thanks for the work you’re doing. I’m thrilled with today’s news.
I have a few questions about the Assistants API pricing structure that don’t seem to be documented anywhere:
When are we charged? Is it upon initiating a run or when adding a message to the thread?
How are tokens calculated? Are we charged for the entire thread on each conversation turn (i.e., run)?
What about token calculation for a long thread that you guys might’ve truncated on the background?
How does token calculation work with knowledge retrieval?
How can I estimate the number of tokens before each run?
The Assistants API takes on a lot of the backend work, but the pricing benefits are not clear. Developers could lose some control, which might even lead to extra costs.
I’m also confused. Do threads reduce input token usage for the context? That would be a game changer.
I’d also like to know how to access the “usage” parameter through the assistant/threads API like you can with completions. Maybe that is not yet released to the beta.
Second this! I seem to recall Roman - from the presentation today mentioning that the Threads/Assistant has a stateful API - without the need for the prior conversation to be re-sent each time. Is it right to assume that only the latest message then contributes to the final billing amount?
Exactly same question here.
It was some kind of a blocker for our virtual assistant to go to the production because of the price of resending all the conversation history at each time a user sends a message.
But with thread, and if the pricing applies only to the last message, it would definitely encourages us to make it live on production.
the overall conversation history and the answer, each time you press run.
for any intermediate tasks created by a run using tools , the actual token usage.
So there is no special gifts, you just pay for everything happening under the hood. Which is understandable. However in term of control, it is difficult to manage. If your user upload a very long document, and the Retrieval method does not produce results, it will suggest to actually read all the document (which will count as input tokens), you can end up quickly at 1$ for a single request.
One question i have is, if you ask another question that requires reading all the document. Will it read all over again, or did he extracted and saved some kind of info to avoid reading all again… (which I honnestly doubt…)
This was my assumption too, which is a huge bummer as I was hoping to stop paying for the entire conversation on each turn (user->assistant->user).
If what you’re saying turns out to be true, then it basically means the Assistants API only does a bit of a lift on the backend side (which many of us already had covered) without really offering any pricing benefits. Worse yet, developers lose some control, which could lead to paying extra.
That appears to be the case on my tests, you pay for the entire context every time you run it which just grows with every messages. In addition, which is kind of hidden unless you look at the logs, you’re charged for any “internal” work it does for functions, code execution, and retrieval
It doesn’t really tell you much about the token usage either, which makes it hard.
I just ran a test and it cost $0.28 for one question because it kept failing the tests when it tried to execute the code. It tried this over and over and over (behind the scenes) and since it kept failing it would do another code run. Normally, without the code run this would cost fractions of a penny to run but with the assistant, it was much more because of all the hidden work it did.
I also want to know. Surely they implemented something like llamaindex with a vector DB on the backend? Otherwise, I’m not sure what value this really adds, apart from you not having to keep your own conversation thread on your app and re-submit in entirety every time… which isn’t hard.
Imagine it starts a loop like ChatGPT iterating over and over. That will be crazy. Even for this task it is costing that amount on scale how much that will cost.
Based on some quick tests with my documents after only a handful of messages and nearly $10 in charges it seems to be referencing the entire thread. It may be the prompts I have used that are causing this however I agree a breakdown of the tokens used would be extremely helpful. At the current state it would allow users to quickly rack up a tab if used improperly.
We absolutely need to understand the costs of calling the API, both in token counts as well as CI sessions, since those are $0.03 each but the API does not tell you when one is created.
Also, since the files being uploaded may be dynamic in size (i.e. related to user inputs), and the price is per GB, it would be great if the API returned that information as well so that we don’t have to open user data files.
I think files retrieval are billed differently. it is not token-based, but $0.2/gb/assistant/day. The files size is storage size, not retrieval size. That means, for a given day, a given assistant (attached with the same files, files changes count as different assistant, I think), it will only charge you $0.2/gb whatever how many times you use the files.
But, I’m speculating based on how RAG works, they use these docs to pull out text that they past to the model to answer the question which means whatever text they pull out will be billed at tokens. This can be dynamic even with the same files.
That isn’t at all clear to me. I understand they charge for storage, but then when the content from that doc is brought into context then it probably also counts as tokens. There’s just no way to know based on the shape of the API currently.
So, it appears that we aren’t just talking input/output tokens, but also context (history) as well as document retrieval/search. Whew! No wonder several folks (myself included) are seeing usage costs of $1 or more for just two or three questions.