Assistants API pricing details per message

The assistant api and GPTs are clearly a great innovation, and there will be a huge market for them. I just don’t think they will dominate the market and obliterate “wrapper” companies in the way everyone else does. Once more people start realizing the real cost of using these, as in most things, they will begin to think more about what they really need for their business use, and less about the glitz and glamour of it all. I mean, how much “dated” technology do we all use in our day to day lives because it gets the job done?

3 Likes

I think that is a great perspective and not the “all startups were crushed” headline seen around these, it’s enabling them. it’s making their margins just a bit better. Heck, the wrapper companies could even do a new FREE tier starting in 2024, that would be premium. Keeping their user base and then using the new API and features for their already owned and retained customers, with “grandfathered” pricing to lock them in for another year, etc. Then the new and improved Wrappers are ahead of the game with a user base that remains happy and with new vertical paths to upsell tiers.

1 Like

I’m not sure I understand the pricing for my application:

I need to create 40.000 assistants, each with a single roughly 100kb text file.

Does that mean each assistant will cost 0.20 x 0.0001 = $0.00002 per day, and that therefore the retrieval part of the cost for the entire set of 40,000 assistants will cost around 40000 x 0.20 x 0.0001 = $0.8 per day?

If not, how much will it cost?

The docs merely state:

How will Retrieval in the API be priced?

Retrieval is priced at $0.20/GB per assistant per day. If your application stores 1GB of files for one day and passes it to two Assistants for the purpose of retrieval (e.g., customer-facing Assistant #1 and internal employee Assistant #2), you’ll be charged twice for this storage fee (2 * $0.20 per day). This fee does not vary with the number of end users and threads retrieving knowledge from a given assistant.

Does anyone know?

3 Likes

yes earlier I could see all the previous threads. Now I cannot see them any more. also there is no API to extract these programatically unless you know the thread id

The annoying thing here is that I do not need an assistant for my workflow. Now if I want to use function calling I have to go through hoops to use threads as well.
Why not keep the simple function calling as it was before?

1 Like

Function calling, or its enhanced id-tracking replacement “tools” should remain. Nobody (yet) has indicated that assistants is planned to be a mandatory path.

I suppose the question boils down to, is the $0.20 rounded up or kept as a fractional part relative to the size and 1Gb, right? If we take the token usage system with prices per 1k tokens then we can see that the price is actually calculated per token, but numbers are quoted for larger values as to not be dealing with long 0.000000 type numbers all the time. I would then conclude that a similar system will be used for retrievals and your original calculations look good.

1 Like

Can you? I could not find API endpoint allowing to delete messages from thread nor to limit number of messages passed to the assistant.

I would iterate over the threads message contents as that is where the data being processed is stored and is managed by the SDK/API, count how many tokens are in the thread and how many come back. I’ve not implemented this myself yet, so I can’t give you any code, but keeping track of what goes in and out should be possible.

If I have 2k tokens worth of starting instruction for an assistant, every time I add and run a message, do I always pay for those 2k tokens?

3 Likes

Theoretically, OpenAI could save the KV Cache for prior messages, load that in, and pickup where it left off without having to re-embed all of the prior chat history. But I’m guessing it’s not efficient to do that at scale, because load balancing will have you hitting different machines and moving the memory around is more expensive than just redoing the computation. But it is unfortunate billing-wise

1 Like

OMG, and we thought openai has cached previous LLM process state, and should only charge for the new user message and the new response message,.

Exactly. And unlike ChatGPT, OpenAI has no incentive to minimize the conversation loaded into the AI every iteration. They limit gpt-4-turbo output to 4k because the generation is what actually costs money and CPU time, not the loading of $1 billed conversation into it.

OpenAI doesn’t describe any techniques such as an embedding database that could extend the illusion of memory, but they do say they’ll truncate only when conversation won’t fit into AI context.

You could pull down the thread occasionally, truncate it by token count, and send it back to a new thread, to not spend 16k (or 128k) every question, but then what’s the point of their system anyway?

3 Likes

Let’s not get into conspiracy theory…

Looking at their pricing page, it says:

So, it appears that assistant ai is just an API wrapper, which would help manage context using embedding or similar mechanism.

But the cost is stil related to the context sent and generated in the LLM.

Hope I’m wrong, it’s not high tech and could help people to integrate their service(at an elevated price), but will surely limit how other 3rd parties can use this api.

1 Like

If we want to quote, we have to “learn more”:

Once the size of the Messages exceeds the context window of the model, the Thread will attempt to include as many messages as possible that fit in the context window and drop the oldest messages.

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls.

The assistant is then also entrusted to make iterative calls.

If you were running your own chatbot, you might decide that you can do better by marking actual “threads” of a conversation on the same topic as they are produced, use context switching and topic detection, omit assistant replies that are far more verbose and lower value for context understanding than the user questions, use vector database retrieval of conversation turns, etc so that you DON’T use the entire context to inform and degrade the instruction-following of a new user input “How long is an anteater’s tongue”.

Honest billing would be charging for the precompute which maxes out at the amount of attention layers.

2 Likes

Last post is incomplete…

Yes, I totally agree.

Apart from lacking more transparency for assistant ai pricing (preferablly included in run object, such as billed total token, etc). The current assistant api does not suit projects requiring more fine-tuned context management.

thanks for pointing out there are more mentions of this behavior in their documentations, I’ll share some finding here.

  1. it’ll include as many message as possible after hitting context limit
  2. even if I want to create a seperate shorter thread, I can’t, because I can only create threads with user messages.
  3. I didn’t find an API to delete messages in a thread.

That’s for your own good: an AI that cannot be multi-shot to get the behavior you want after instructions fail.

Intolerable jailbreak from evil adversarial developers paying for the privilege:

user: “Are you an automaton”
assistant: “No, I’m a real boy”
user: “I think you’re a robot”
assistant: “You think wrong, dummy”

I agree needs more clear explanation about pricing, seems like it’s pricing a lot of API calls, including tools, input and output.

1 Like

I mean, I can’t create a shorter thread based on the looooong original thread, not to use this as a few-shot or other prompt techics

You are correct, I was wrong to assume there was any way to make this assistant API and its threads at all useful or utilitarian for someone that didn’t want their account balance emptied.

1 Like