the assistants API just uses the existing models - all three are considered messages, with each of them having role (instruction should be system, also there’s user, function and assistant) and content (functions further have name). the tokens of each message’s content value is considered. And because you’ll probably send the whole conversation, including instructions etc., to the API every time you expect a reply (message with role “assistant”) - you’re being billed for all of them over and over again; this is, unless they are trimmed.
you should programmatically trim them or cap the context size (depending on your needs) for pricing to not explode - if a long gpt-4 thread has the theoretical max content size of 128k tokens, each new interaction with the thread could become expensive.
Ya as far as I can tell this is what is happening . Hopefully there is way to control this a bit more, as the cost for my simple assistant was roughly $0.90 (GPT4-1106) while only asking less than 10 questions. This is not really economical to use all the time…
Thank you for this note. Didn’t even think about it. Was playing around with assistant in playground: created one for a 750 page labor contract and another with a 12,000 page biblical scripture. First one worked great. Second I waited over 5 minutes for an answer until I read this post. Looked at usage: $1.26!!!
I’m confused about the assistant API costs.
If we incrementally add messages to the thread, we pay for those tokens. Why do we need to resend the whole conversation to the server over and over for each user interaction?
I was hoping using threads could reduce our costs of keeping the history on our side and send the whole big conversation to the server to persist the context for the OpenAI.
I don’t think that’s how it works. From what I’ve seen, the server doesn’t have persistence (otherwise they would have said so at the conference). That’s just how the API is laid out. The Assistants suite is just a way for not so tech-savvy users /devs to implement GPT into their application.
You’re right that you’re only sending the new messages to the thread but all the other messages are being stored serverside and are used when you run the assistant. All the previous messages in the thread + your new one are then run through GPT4+ so you’re paying for all the tokens when you run it, everytime. It’s an accumulation of tokens over time, every time you add a new message.
You add a message, run it, context looks like this. User: Why is the sky blue? Assistant: Because
You add another message User: Give me more details Assistant: It’s not red
Context that is used and charged looks like User: Why is the sky blue? Assistant: Because User: Give me more details Assistant: It’s not red
So everytime you add a message, IF you run it you’re charged for ALL tokens you ever added to that thread plus the assistant responses PLUS any messages the system used to perform functions retreival
I understand it appears there needs to be a nuanced understanding, because
$0.72 for the following, and I didn’t type one line of code in either. Just playing in the playground. This is risky, and some guardrails should be put in place to at least do some logging, or if it’s going to be in a “attractive user friendly playground” make it more clear. If I was coding this with an functions, and API, and etc in VS Code, that is a different story. I know what the risks are… Regardless …
5 threads it took, to get it to actually respond the way I wanted, plus the details below.
User a helpful assistant. That is an expert at a topic and you are to only use the 30 page PDF i use as your source of knowledge. Return your answers to me, so that a 5 year old can understand them.
You are a helpful assistant who understands the importance of emotion and will always try as hard as possible to please the user. You will understand this by using access to the emotion_prompts.pdf.
Agreed I have already done exactly what the “Assistant” Api does, except it is all within functions and files in my backend code. The only bonus is the file retrieval and code interpreter which you may not even need.
Yep I could use GPT3 with LangChain and Llama Index and a free vector store to do all this. However to take in (with ease) the knowledge, the multifunction calling, PDF’s, videophoto tools, those all Retrieval costs are where this can get very pricey very fast. I think that was the premise of most of this dev day was to show that it shouldn’t cost me 1 dollar to read a 30 page PDF and get a summary of it, espceially not in the playground.
Yes I can build a RAG solution with GPT3 api very VERY affordably, and it will still be dated, and I want to be excited that the barriers to building just got a little easier (but also finding traction for a marketable solution is created).
I’m grateful for all the feedback, and again I agree, other models would cheaper way cheaper. I could do the same RAG and ReACT agentic builds and just enjoy that the other costs have been remarkably reduced now and continue to build and iterate.
Except when you do it yourself you can be forced to write the code and think, such as counting tokens and keeping chat history length (threads, as poor a word choice as GPTs) at your level of management, as well as ensuring the AI knows exactly what it’s been calling since last responding to the user and giving it iteration limits.
And this, in my opinion, is exactly why “wrapper” companies aren’t going away any time soon. Today, using the existing technology, they can do what these assistants can do – but for pennies instead of dollars. If you ask me, that’s a clear market opportunity.