correct. Earlier it was costly but it can be constrained but now the cheaper one is emptying our pockets
May be GPT 5 created this api
correct. Earlier it was costly but it can be constrained but now the cheaper one is emptying our pockets
May be GPT 5 created this api
It’s on purpose. They have limited bandwidth and by making so expensive it reduces their traffic. Fair enough.
Wow. These $35 disappeared quickly.
I thought it would be cheaper because I uploaded files with content, and asked to perform tasks using this content. Turns out I was paying for the entire file with each message.
Actually thats exactly what happens and its paifully expensive.
See my post from a few minutes ago:
Yes, I would say that the whole chunking and context awareness is not available yet. This seems to be only a future feature. Otherwise I do not understand why we have to pay for it on every request. It’s just unusable.
Thanks for your comment and advice.
I myself am working on a project that will use the Open AI’s API and the GPT models mainly (other tools maybe later). Could you explain implementation of the ways to catch output and make Chat GPT reuse it (how to restrain him to avoid additional unnecessary research/token use ? Additionally, same question for optimising local program labour instead of API call ?
Hi, thank you for your comment and deep explanation.
Could you explain or would it be possible to contact you to understand how you did what you described and if you had to use technics such as internal double thread to run the sub actions, or other things ?
I can outline the concept, hope this helps:
Every time you call the normal chat api you send a conversation history. At first it is just a system message and a user message - something like “Hi! let’s do [thing]”
However, from there this system proceeds differently.
As each message is sent or received it is added to a chromadb (vector database).
Each message is also added to a class that stores the ‘normal’ messages, each prompt and response in order. This is what you see as a user.
Each api call with the conversation is sent also to gpt3 and it’s asked to indentify any problems in how the direction of the conversation is headed, confusion, etc. These are used to adjust messages so they are more useful to gpt4 and won’t drag things off-course, and also to create condensed versions and summaries of long messages. If the message is an aside it gets tagged with a different subject also.
These are all then stored in chromadb as what i think of as ‘shadow’ messages. Not what you see as a user, but tuned, compact and effective for assembling very coherent conversations to send to gpt4.
Once the conversation is fully underway and things are getting complicated every time the user sends a message the conversation history (list of messages starting with system message for gpt4) is built from the ground up using everything on hand. Chromadb has great features to make word, sentence, and summary similarity easy. Throw an llm in the mix and it’s powerful. When the conversation is very long there will also be summaries of entire sections, for example the first 1/3 of the original messages might get condensed into a single message.
So, deep into a conversation what is being sent to the api is not what you see on your screen. When you send your message it is analysed by gpt4 which makes a plan for what should be sent. What’s sent to the api is an assemblage of some complete messages from user and gpt4 in their original state (mostly recent ones), some reworked versions of some of the messages, and summaries of some entire sections. All optimised to maintain very clear intent… and minimise token use!
Sometimes the next question in a conversation doesn’t actually need the whole history to work well, sometimes that huge history muddies the waters. You can be 6k tokens into a conversation and when you hit send and it assembles the best history for a good response it’s actually just 3 or 4 messages. It’s only sending a few hundred tokens - and the response is top notch.
In one sentence: The system balances the amount of tokens used with the accuracy of intent and sends a custom optimised message list each time.
There’s a ton more technical detail I left out so it isn’t more confusing, hope that gets the idea across!
Thanks for all of this, and the original question.
Further, I gather it’s confirmed that we’re incurring the cost for the entire conversation just as if we’d sent it up ourselves.
I’m curious also about the Assistant instructions. Do we incur token costs for the Assistant instructions on every run? Had been thinking not likely, but now not so sure.
Instruction max length is 32768 characters, which could be used to provide meaningful context. But say you use all of it, are you incurring token charges for all 5k+ words every run?
I would say yes. They are part of the input tokens. The generated messages generated by the run are the output tokens. You get billed for both.
Yes, the assistants setup is kind of just a way for managing system messages plus conversation management done on their end. Things still work exactly the same as far as presenting info to the llm, it sends a whole conversation with system message every time.
i built a working prototype based on the Assistant API.
I just remembered from reading the messages that I spent 10 USD to building it. I agree with everyone that there should be some sort of token management feature inside the threads e.g. a param like chat_history_n = 10, where you can specify the last n User/Assistant exchange that will be sent to the LLM inside threads. You don’t want to be charged 128k tokens just by saying ‘bye’ in an ongoing thread.
Thanks for your responses on the use of Assistant instructions and all prior responses being submitted with each new message.
Suppose I can see why this would be the most straightforward way to implement a first version of Assistants. Simply (I know, nothing’s simple!) moves the responses and instructions to persistent storage on OAI’s servers, then submit it all same as if it was sent up by the client.
At first I just assumed that there would be a large benefit. Some kind of persistent summarization or relief from input token usage in this model, and went deep on implementing it as an evolution of my approach, which is quite Assistant-like but with state on my server. The tools are a differentiator, but I and other have found the retrieval of a file is erratic: myFiles_ browser seems to fail regularly or at best intermittently via API. Some report it works ok in playground.
Also, it’s unclear what costs are being incurred by file retrieval/RAG, with no token consumption reporting by the API.
So, net net, while it’s great to have this first cut and I really dig the way it was implemented and the Assistant architecture, it looks like a wait and see. Hopefully the team at OAI will take it to the next step, getting some cost benefits and token utilization optimization up and running, opportunities that appear to be inherent in the conception and implementation of the Assistant model.
Ron
I would even argue that at the current state the assistant api is only a wrapper around the chat api with some additional “context keeping” functionality but no file support.
I also don’t understand these function calling. Why not give a real programming language support where we can implement reliblae functions and be able to invoke them. Its really a prompt and pray (prompt and pay) approach OpenAI is giving us here.
The Assistant API doesn’t seem to significantly affect instruction token costs. In my case, the thread’s context isn’t essential. Is it better to solely rely on the API? each request i will send the prompt with the instruction and input , I’m not seeing the actual benefits of the Assistant API.
what’s your idea for this? what’s the ‘signal’ you’ll take from the User/Assistant ‘content’ (that is inherently non-deterministic) that your function needs to be invoked?
I liked the idea behind function calling. The LLM doesn’t run the function, you do - which means you can build intermediate checks/guardrails to ensure that the function/arguments ‘suggested’ by the LLM are correct - it increases reliability by adding more ‘human’ in the loop.
Not only that, you can inject more arguments as your function might require user details that the LLM should not even be aware of.
If you upload files to the Assistant (not to the Thread), do they also incur additional token costs when the Thread is processing messages?
I would say only when you reference the file in your prompt or over the thread file_ids… But it’s all very vague… Not 100% sure about it and hard to debug and track since you can’t see how many tokens were generated 7 used during a “run”. Guess that is intentional
Hi, the Assistant is an excellent tool with much potential…
Can you explain how the tokens are counted when a file is uploaded to Assistant? Is it counted once when chunks are tokenized or each time there is a Run for the Assistant? After tokenization and embeddings are generated per chunk, I presume behind the scenes only the nearest “Top K” (is ‘K’ settable?) are selected to answer the query? Thanks! -Andy